KaLM-Embedding
KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data.
KaLM-embedding-multilingual-mini is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and supervised fine-tuning data.
📑 Open-source Plan
- Model Checkpoint
- KaLM-embedding-multilingual-mini-v1
- KaLM-embedding-multilingual-mini-instruct-v1
- KaLM-embedding-multilingual-max-v1
- Technical Report
- Training and Evaluation Code
- Training Data
Evaluation
Model Name | Model Size | C-MTEB(35) | MTEB(56) | avg |
---|---|---|---|---|
multilingual-e5-large | 560M | 58.81 | 61.5 | 60.16 |
bge-m3 (dense) | 560M | 60.80 | 59.84 | 60.32 |
gte-multilingual-base (dense) | 305M | 62.72 | 61.40 | 62.06 |
KaLM-embedding-multilingual-mini-v1 | 494M | 62.31 | 61.87 | 62.09 |
KaLM-embedding-multilingual-mini-instruct-v1 | 494M | 63.57 | 64.74 | 64.16 |
Requirements
Since we have used the Qwen2 model, we advise you to install transformers>=4.37.0
, or you might encounter the following error:
KeyError: 'qwen2'
Usage
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME_OR_PATH}') # Do NOT set trust_remote_code
model.max_seq_length = 512
embeddings = model.encode(
sentences,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=True
)
print(embeddings)
We add instruction for classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME_OR_PATH}') # Do NOT set trust_remote_code
model.max_seq_length = 512
prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
sentences,
prompt=prompt,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=True
)
print(embeddings)
Contact
If you encounter any issue, feel free to contact us via the email: [email protected]
- Downloads last month
- 3,324
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported74.160
- ap on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported22.731
- ap_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported22.731
- f1 on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported61.311
- f1_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported78.921
- main_score on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported74.160
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported72.358
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported34.130
- ap_weighted on MTEB AmazonCounterfactualClassification (en)test set self-reported34.130
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported65.911