bwang0911 commited on
Commit
7ba833c
1 Parent(s): 5fd4882

docs: update readme

Browse files
Files changed (1) hide show
  1. README.md +8 -13
README.md CHANGED
@@ -21540,9 +21540,9 @@ The easiest way to start using `jina-embeddings-v3` is with the [Jina Embedding
21540
 
21541
 
21542
  `jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
21543
- Based on the [XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
21544
  this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long input sequences up to **8192 tokens**.
21545
- Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
21546
 
21547
  ### Key Features:
21548
  - **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
@@ -21554,13 +21554,8 @@ Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to g
21554
  - `text-matching`: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks
21555
  - **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
21556
 
21557
- ### Model Lineage:
21558
-
21559
- The `jina-embeddings-v3` model is an enhancement of the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, initially trained on 100 languages. This model's functionality has been extended through an additional pretraining phase using the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset. Additionally, LoRA was employed to increase the context length to 8192 tokens. For further optimization, contrastive fine-tuning was performed across 30 languages, improving its performance in both monolingual and cross-lingual embedding tasks.
21560
-
21561
-
21562
  ### Supported Languages:
21563
- While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages:
21564
  **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
21565
  Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
21566
  Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
@@ -21610,7 +21605,7 @@ model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code
21610
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
21611
 
21612
  with torch.no_grad():
21613
- model_output = model(**encoded_input)
21614
 
21615
  embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
21616
  embeddings = F.normalize(embeddings, p=2, dim=1)
@@ -21703,16 +21698,16 @@ embeddings = model.encode(
21703
  | jina-embeddings-v3 | 1024 | **65.60** | **82.58**| 45.27| 84.01| 58.13| 53.87| **85.8** | 30.98|
21704
  | jina-embeddings-v2-en | 768 | 58.12 | 68.82 | 40.08| 84.44| 55.09| 45.64| 80.00| 30.56|
21705
  | text-embedding-3-large | 3072 | 62.03 | 75.45 | 49.01| 84.22| 59.16| 55.44| 81.04| 29.92|
21706
- | multilingual-e5-large-instruct | 4096 | 64.41 | 77.56 | 47.1 | 86.19| 58.58| 52.47| 84.78| 30.39|
21707
- | Cohere-embed-multilingual-v3.0 | 4096 | 60.08 | 64.01 | 46.6 | 86.15| 57.86| 53.84| 83.15| 30.99|
21708
 
21709
  ### Multilingual MTEB
21710
 
21711
  | Model | Dimension | Average | Classification | Clustering | Pair Classification | Reranking | Retrieval | STS | Summarization |
21712
  |:------------------------------:|:---------:|:---------:|:--------------:|:----------:|:-------------------:|:---------:|:---------:|:---------:|:-------------:|
21713
  | jina-embeddings-v3 | 1024 | **64.44** | **71.46** | 46.71 | 76.91 | 63.98 | 57.98 | **69.83** | - |
21714
- | multilingual-e5-large | 4096 | 59.58 | 65.22 | 42.12 | 76.95 | 63.4 | 52.37 | 64.65 | - |
21715
- | multilingual-e5-large-instruct | 4096 | 64.25 | 67.45 | **52.12** | 77.79 | **69.02** | **58.38** | 68.77 | - |
21716
 
21717
 
21718
  ### Long Context Tasks (LongEmbed)
 
21540
 
21541
 
21542
  `jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
21543
+ Based on the [Jina-XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
21544
  this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long input sequences up to **8192 tokens**.
21545
+ Additionally, it features 5 [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
21546
 
21547
  ### Key Features:
21548
  - **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
 
21554
  - `text-matching`: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks
21555
  - **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
21556
 
 
 
 
 
 
21557
  ### Supported Languages:
21558
+ While the foundation model supports 89 languages, we've focused our tuning efforts on the following 30 languages:
21559
  **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
21560
  Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
21561
  Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
 
21605
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
21606
 
21607
  with torch.no_grad():
21608
+ model_output = model(**encoded_input, task_type='retrieval.query')
21609
 
21610
  embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
21611
  embeddings = F.normalize(embeddings, p=2, dim=1)
 
21698
  | jina-embeddings-v3 | 1024 | **65.60** | **82.58**| 45.27| 84.01| 58.13| 53.87| **85.8** | 30.98|
21699
  | jina-embeddings-v2-en | 768 | 58.12 | 68.82 | 40.08| 84.44| 55.09| 45.64| 80.00| 30.56|
21700
  | text-embedding-3-large | 3072 | 62.03 | 75.45 | 49.01| 84.22| 59.16| 55.44| 81.04| 29.92|
21701
+ | multilingual-e5-large-instruct | 1024 | 64.41 | 77.56 | 47.1 | 86.19| 58.58| 52.47| 84.78| 30.39|
21702
+ | Cohere-embed-multilingual-v3.0 | 1024 | 60.08 | 64.01 | 46.6 | 86.15| 57.86| 53.84| 83.15| 30.99|
21703
 
21704
  ### Multilingual MTEB
21705
 
21706
  | Model | Dimension | Average | Classification | Clustering | Pair Classification | Reranking | Retrieval | STS | Summarization |
21707
  |:------------------------------:|:---------:|:---------:|:--------------:|:----------:|:-------------------:|:---------:|:---------:|:---------:|:-------------:|
21708
  | jina-embeddings-v3 | 1024 | **64.44** | **71.46** | 46.71 | 76.91 | 63.98 | 57.98 | **69.83** | - |
21709
+ | multilingual-e5-large | 1024 | 59.58 | 65.22 | 42.12 | 76.95 | 63.4 | 52.37 | 64.65 | - |
21710
+ | multilingual-e5-large-instruct | 1024 | 64.25 | 67.45 | **52.12** | 77.79 | **69.02** | **58.38** | 68.77 | - |
21711
 
21712
 
21713
  ### Long Context Tasks (LongEmbed)