hotchpotch/vespa-onnx-intfloat-multilingual-e5-small

Converted intfloat/multilingual-e5-small model in onnx format for use with Vespa Embedding.

intfloat-multilingual-e5-small.onnx
intfloat-multilingual-e5-small_fp16.onnx
intfloat-multilingual-e5-small_quantized.onnx
- (int8 quantize, In python, running it produces a different result...)

python can also output the same vectors as vespa's embeddings.

Note: normalize must be set to true in vespa's service.xml in order for embeddings output to be the same as python.

<component id="me5_small_q" type="hugging-face-embedder">
    <transformer-model path="me5/intfloat-multilingual-e5-small_quantized.onnx" />
    <tokenizer-model path="me5/tokenizer.json" />
    <normalize>true</normalize>
    <pooling-strategy>mean</pooling-strategy>
</component>

<component id="me5_small" type="hugging-face-embedder">
    <transformer-model path="me5/intfloat-multilingual-e5-small.onnx" />
    <tokenizer-model path="me5/tokenizer.json" />
    <normalize>true</normalize>
    <pooling-strategy>mean</pooling-strategy>
</component>

or url

        <component id="me5_small_fp16" type="hugging-face-embedder">
            <transformer-model
                url="https://huggingface.co/hotchpotch/vespa-onnx-intfloat-multilingual-e5-small/resolve/main/intfloat-multilingual-e5-small_fp16.onnx" />
            <tokenizer-model
                url="https://huggingface.co/hotchpotch/vespa-onnx-intfloat-multilingual-e5-small/resolve/main/tokenizer.json" />
            <normalize>true</normalize>
            <pooling-strategy>mean</pooling-strategy>
        </component>

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
from torch import Tensor
import torch
import torch.nn.functional as F

model_name = "hotchpotch/vespa-onnx-intfloat-multilingual-e5-small"
onnx_file_name = "intfloat-multilingual-e5-small.onnx"

model = ORTModelForSequenceClassification.from_pretrained(
    model_name, file_name=onnx_file_name
)
# override for last_hidden_states
model.output_names["logits"] = 0
tokenizer = AutoTokenizer.from_pretrained(model_name)


def average_pool(last_hidden_state: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


input_texts = [
    "query: What is the capital of Japan?",
    "query: 日本の首都は?",  # "What is the capital of Japan?" in Japanese
    "passage: ニューヨークは大きな都市です年エネ年エネ",  # "New York is a big city" in Japanese
    "passage: 東京は良い場所です",  # "Tokyo is a good place" in Japanese, Tokyo is the capital of Japan.
]

batch_dict = tokenizer(
    input_texts, max_length=512, padding=True, truncation=True, return_tensors="pt"
)

if "token_type_ids" not in batch_dict:
    batch_dict["token_type_ids"] = torch.zeros_like(batch_dict["input_ids"])

# logits is last_hidden_state
last_hidden_states = model(**batch_dict).logits
embeddings = average_pool(last_hidden_states, batch_dict["attention_mask"])

# same vespa embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

# similarity score
print(embeddings[:2] @ embeddings[2:].T)

License

same e5 (MIT)

https://huggingface.co/intfloat/multilingual-e5-small

Attribution

All credits for this model go to the authors of Multilingual-E5-large and the associated researchers and organizations. When using this model, please be sure to attribute the original authors.