HERE IS HOW YOU USE THIS WITH TEI OR INFERENCE ENDPOINTS

from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "my-endpoint-name",
    repository="Alibaba-NLP/gte-multilingual-base",
    revision="refs/pr/7",
    framework="pytorch",
    task="sentence-embeddings",
    custom_image={
        "health_route": "/health",
        "env": {"MODEL_ID": "/repository",},
        "url": "ghcr.io/huggingface/text-embeddings-inference:1.5",
    },
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="x1",
    instance_type="nvidia-l4",
    token="hf_token_with_write_permissions"
)

If launching from a command line, then you can use

model=Alibaba-NLP/gte-multilingual-base
volume=$PWD/data
revision=refs/pr/7

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model --revision=$revision

nbroad changed pull request title from remove token cls architecture to HERE IS HOW YOU USE THIS WITH TEI OR INFERENCE ENDPOINTS Aug 16

nbroad

Aug 16

•

edited Aug 16

In short, I had to:

remove NewModelForTokenClassification from architectures in config.json
rename the keys in the safetensors file to not start with "new". compare the new keys with the old keys

izhx

Alibaba-NLP org Aug 17

Huge thanks!
But we prefer to keep the ForTokenClassification in config.json for sparse weights prediction if it is need by the auto model loading AutoModelForTokenClassification.
I will try to make the existing structure work with TEI, if it is possible.

Will back to you

nbroad

Aug 17

You don’t need to merge this. People can use this branch for TEI or inference endpoints

sigridjineth

Aug 17

•

edited Aug 17

@nbroad @izhx I want to run https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base/tree/refs%2Fpr%2F3 and try to allocate id2label and label2id -- still working. have you tried it?

and you mean that using this branch will NOT make the model to infer with sparse weights?

(I am doing some experiments on here but no fruitful results has came yet: https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base/discussions/3)

izhx pinned discussion Aug 17

Maku319

Aug 20

•

edited Aug 20

I'm really sorry to bother you, I’ve tried running TEL using Docker and Cargo, but in Docker, it keeps saying that ONNX is missing.

docker run -p 8080:80 -v $volume:/data ${local-image} --model-id $model --revision=$revision
2024-08-20T09:32:14.646243Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "Ali****-***/***-************-*ase", revision: Some("refs/pr/7"), tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "127b8c571d1b", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-08-20T09:32:14.646508Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-08-20T09:32:14.678359Z  INFO download_pool_config: text_embeddings_core::download: core/src/download.rs:38: Downloading `1_Pooling/config.json`
2024-08-20T09:32:17.348216Z  INFO download_new_st_config: text_embeddings_core::download: core/src/download.rs:62: Downloading `config_sentence_transformers.json`
2024-08-20T09:32:17.704659Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:21: Starting download
2024-08-20T09:32:17.704698Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:23: Downloading `config.json`
2024-08-20T09:32:18.507387Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:26: Downloading `tokenizer.json`
2024-08-20T09:32:22.272394Z  INFO download_artifacts: text_embeddings_backend: backends/src/lib.rs:368: Downloading `model.onnx`
2024-08-20T09:32:22.635404Z  WARN download_artifacts: text_embeddings_backend: backends/src/lib.rs:372: Could not download `model.onnx`: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/Alibaba-NLP/gte-multilingual-base/resolve/refs%2Fpr%2F7/model.onnx)
2024-08-20T09:32:22.635437Z  INFO download_artifacts: text_embeddings_backend: backends/src/lib.rs:373: Downloading `onnx/model.onnx`
thread 'main' panicked at /usr/src/backends/src/lib.rs:316:17:
failed to download `model.onnx` or `model.onnx_data`. Check the onnx file exists in the repository. request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/Alibaba-NLP/gte-multilingual-base/resolve/refs%2Fpr%2F7/onnx/model.onnx)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

nbroad

Aug 20

The way I made it work

model=Alibaba-NLP/gte-multilingual-base
revision=refs/pr/7
volume=/tmp

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model --revision $revision

@Maku319 , what is ${local-image}?

Maku319

Aug 21

@nbroad Thank you for your reply! Since TEI doesn't provide an image version for the M series chip Macs, I built the image locally using the official TEI repository, and that's the local-image.

nbroad

Aug 21

•

edited Aug 21

@Maku319 ,

I'm not sure if there is a solution that works on Mac chips yet. The simplest option to get embeddings quickly would probably be to create an endpoint using Inference Endpoints. You can use the UI here or use the following code to create an endpoint.

from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "my-endpoint-name",
    repository="Alibaba-NLP/gte-multilingual-base",
    revision="refs/pr/7",
    framework="pytorch",
    task="sentence-embeddings",
    custom_image={
        "health_route": "/health",
        "env": {"MODEL_ID": "/repository",},
        "url": "ghcr.io/huggingface/text-embeddings-inference:1.5",
    },
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="x1",
    instance_type="nvidia-l4",
    token="hf_token_with_write_permissions"
)

Maku319

Aug 22

@nbroad Thank you so much for your reply. I think the ONNX model is only necessary when running on the CPU. When I switched to the GPU, everything seemed to work fine, but now I need to figure out the issue with the container not recognizing CUDA after it starts up. Thanks again!

Maku319

Aug 23

@nbroad Thank you for your patient guidance. The images for both GPU and CPU versions have been successfully deployed and are accepting requests. However, I have a question: my ONNX model was converted based on the configuration from the main branch, so why is it able to run with the configuration from the pr/7 version you provided?
The command I ran was: docker run -p 9090:80 -v ${PWD}:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.5-grpc --model-id /data/gte-multilingual-base. The converted ONNX model is located at: gte-multilingual-base\onnx\.

Also, if I use the repository from the main branch, it fails instead?
The error message is as follows :

INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/dat*/***-************-*ase", revision: None, 
tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "bbf17dcff344", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
Error: `config.json` does not contain `id2label`

nbroad

Aug 25

I think it's because of the architectures listed in the config file

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment