jupyterjazz
commited on
Commit
•
3468cf0
1
Parent(s):
e50debb
Update README.md
Browse files
README.md
CHANGED
@@ -134,18 +134,18 @@ Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to g
|
|
134 |
### Key Features:
|
135 |
- **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
|
136 |
- **Task-Specific Embedding:** Customize embeddings through the `task_type` argument with the following options:
|
137 |
-
- `retrieval.query`:
|
138 |
-
- `retrieval.passage`:
|
139 |
-
- `separation`:
|
140 |
-
- `classification`:
|
141 |
-
- `text-matching`:
|
142 |
- **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
|
143 |
|
144 |
### Model Lineage:
|
145 |
|
146 |
`jina-embeddings-v3` builds upon the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, which was originally trained on 100 languages.
|
147 |
We extended its capabilities with an extra pretraining phase on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset,
|
148 |
-
then contrastively fine-tuned it on 30 languages for enhanced performance in both monolingual and cross-lingual setups.
|
149 |
|
150 |
### Supported Languages:
|
151 |
While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages to maximize performance:
|
|
|
134 |
### Key Features:
|
135 |
- **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
|
136 |
- **Task-Specific Embedding:** Customize embeddings through the `task_type` argument with the following options:
|
137 |
+
- `retrieval.query`: Used for query embeddings in asymmetric retrieval tasks
|
138 |
+
- `retrieval.passage`: Used for passage embeddings in asymmetric retrieval tasks
|
139 |
+
- `separation`: Used for embeddings in clustering and re-ranking applications
|
140 |
+
- `classification`: Used for embeddings in classification tasks
|
141 |
+
- `text-matching`: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks
|
142 |
- **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
|
143 |
|
144 |
### Model Lineage:
|
145 |
|
146 |
`jina-embeddings-v3` builds upon the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, which was originally trained on 100 languages.
|
147 |
We extended its capabilities with an extra pretraining phase on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset,
|
148 |
+
then contrastively fine-tuned it on 30 languages for enhanced performance on embedding tasks in both monolingual and cross-lingual setups.
|
149 |
|
150 |
### Supported Languages:
|
151 |
While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages to maximize performance:
|