docs: update readme
Browse files
README.md
CHANGED
@@ -21540,9 +21540,9 @@ The easiest way to start using `jina-embeddings-v3` is with the [Jina Embedding
|
|
21540 |
|
21541 |
|
21542 |
`jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
|
21543 |
-
Based on the [XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
|
21544 |
this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long input sequences up to **8192 tokens**.
|
21545 |
-
Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
|
21546 |
|
21547 |
### Key Features:
|
21548 |
- **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
|
@@ -21554,13 +21554,8 @@ Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to g
|
|
21554 |
- `text-matching`: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks
|
21555 |
- **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
|
21556 |
|
21557 |
-
### Model Lineage:
|
21558 |
-
|
21559 |
-
The `jina-embeddings-v3` model is an enhancement of the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, initially trained on 100 languages. This model's functionality has been extended through an additional pretraining phase using the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset. Additionally, LoRA was employed to increase the context length to 8192 tokens. For further optimization, contrastive fine-tuning was performed across 30 languages, improving its performance in both monolingual and cross-lingual embedding tasks.
|
21560 |
-
|
21561 |
-
|
21562 |
### Supported Languages:
|
21563 |
-
While the
|
21564 |
**Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
|
21565 |
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
|
21566 |
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
@@ -21610,7 +21605,7 @@ model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code
|
|
21610 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
|
21611 |
|
21612 |
with torch.no_grad():
|
21613 |
-
model_output = model(**encoded_input)
|
21614 |
|
21615 |
embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
|
21616 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
@@ -21703,16 +21698,16 @@ embeddings = model.encode(
|
|
21703 |
| jina-embeddings-v3 | 1024 | **65.60** | **82.58**| 45.27| 84.01| 58.13| 53.87| **85.8** | 30.98|
|
21704 |
| jina-embeddings-v2-en | 768 | 58.12 | 68.82 | 40.08| 84.44| 55.09| 45.64| 80.00| 30.56|
|
21705 |
| text-embedding-3-large | 3072 | 62.03 | 75.45 | 49.01| 84.22| 59.16| 55.44| 81.04| 29.92|
|
21706 |
-
| multilingual-e5-large-instruct |
|
21707 |
-
| Cohere-embed-multilingual-v3.0 |
|
21708 |
|
21709 |
### Multilingual MTEB
|
21710 |
|
21711 |
| Model | Dimension | Average | Classification | Clustering | Pair Classification | Reranking | Retrieval | STS | Summarization |
|
21712 |
|:------------------------------:|:---------:|:---------:|:--------------:|:----------:|:-------------------:|:---------:|:---------:|:---------:|:-------------:|
|
21713 |
| jina-embeddings-v3 | 1024 | **64.44** | **71.46** | 46.71 | 76.91 | 63.98 | 57.98 | **69.83** | - |
|
21714 |
-
| multilingual-e5-large |
|
21715 |
-
| multilingual-e5-large-instruct |
|
21716 |
|
21717 |
|
21718 |
### Long Context Tasks (LongEmbed)
|
|
|
21540 |
|
21541 |
|
21542 |
`jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
|
21543 |
+
Based on the [Jina-XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
|
21544 |
this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long input sequences up to **8192 tokens**.
|
21545 |
+
Additionally, it features 5 [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
|
21546 |
|
21547 |
### Key Features:
|
21548 |
- **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
|
|
|
21554 |
- `text-matching`: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks
|
21555 |
- **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
|
21556 |
|
|
|
|
|
|
|
|
|
|
|
21557 |
### Supported Languages:
|
21558 |
+
While the foundation model supports 89 languages, we've focused our tuning efforts on the following 30 languages:
|
21559 |
**Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
|
21560 |
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
|
21561 |
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
|
|
21605 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
|
21606 |
|
21607 |
with torch.no_grad():
|
21608 |
+
model_output = model(**encoded_input, task_type='retrieval.query')
|
21609 |
|
21610 |
embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
|
21611 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
|
|
21698 |
| jina-embeddings-v3 | 1024 | **65.60** | **82.58**| 45.27| 84.01| 58.13| 53.87| **85.8** | 30.98|
|
21699 |
| jina-embeddings-v2-en | 768 | 58.12 | 68.82 | 40.08| 84.44| 55.09| 45.64| 80.00| 30.56|
|
21700 |
| text-embedding-3-large | 3072 | 62.03 | 75.45 | 49.01| 84.22| 59.16| 55.44| 81.04| 29.92|
|
21701 |
+
| multilingual-e5-large-instruct | 1024 | 64.41 | 77.56 | 47.1 | 86.19| 58.58| 52.47| 84.78| 30.39|
|
21702 |
+
| Cohere-embed-multilingual-v3.0 | 1024 | 60.08 | 64.01 | 46.6 | 86.15| 57.86| 53.84| 83.15| 30.99|
|
21703 |
|
21704 |
### Multilingual MTEB
|
21705 |
|
21706 |
| Model | Dimension | Average | Classification | Clustering | Pair Classification | Reranking | Retrieval | STS | Summarization |
|
21707 |
|:------------------------------:|:---------:|:---------:|:--------------:|:----------:|:-------------------:|:---------:|:---------:|:---------:|:-------------:|
|
21708 |
| jina-embeddings-v3 | 1024 | **64.44** | **71.46** | 46.71 | 76.91 | 63.98 | 57.98 | **69.83** | - |
|
21709 |
+
| multilingual-e5-large | 1024 | 59.58 | 65.22 | 42.12 | 76.95 | 63.4 | 52.37 | 64.65 | - |
|
21710 |
+
| multilingual-e5-large-instruct | 1024 | 64.25 | 67.45 | **52.12** | 77.79 | **69.02** | **58.38** | 68.77 | - |
|
21711 |
|
21712 |
|
21713 |
### Long Context Tasks (LongEmbed)
|