jupyterjazz
commited on
Commit
•
ad320ec
1
Parent(s):
521abc0
adjust readme
Browse filesSigned-off-by: [email protected] <[email protected]>
README.md
CHANGED
@@ -21524,7 +21524,7 @@ model-index:
|
|
21524 |
|
21525 |
|
21526 |
<p align="center">
|
21527 |
-
<b>The embedding
|
21528 |
</p>
|
21529 |
|
21530 |
<p align="center">
|
@@ -21555,7 +21555,7 @@ Additionally, it features 5 LoRA adapters to generate task-specific embeddings e
|
|
21555 |
- **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
|
21556 |
|
21557 |
### Supported Languages:
|
21558 |
-
While the foundation model supports
|
21559 |
**Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
|
21560 |
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
|
21561 |
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
@@ -21598,9 +21598,11 @@ tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
|
|
21598 |
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
|
21599 |
|
21600 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
|
21601 |
-
|
|
|
|
|
21602 |
with torch.no_grad():
|
21603 |
-
model_output = model(**encoded_input,
|
21604 |
|
21605 |
embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
|
21606 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
@@ -21661,9 +21663,6 @@ embeddings = model.encode(['Sample text'], truncate_dim=256)
|
|
21661 |
```
|
21662 |
|
21663 |
|
21664 |
-
Note that the `truncate_dim` could be any integer between 1 and 1024 for the `separation`, `classification`, and `text-matching` tasks. As for the `retrieval.passage` and `retrieval.query` tasks, the value must be larger than the length of the instruction prompt. By default, the value must be larger than 9 for the `retrieval.passage` task and larger than 12 for the `retrieval.query` task.
|
21665 |
-
|
21666 |
-
|
21667 |
The latest version (3.1.0) of [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) also supports `jina-embeddings-v3`:
|
21668 |
|
21669 |
```bash
|
|
|
21524 |
|
21525 |
|
21526 |
<p align="center">
|
21527 |
+
<b>The embedding model trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
21528 |
</p>
|
21529 |
|
21530 |
<p align="center">
|
|
|
21555 |
- **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
|
21556 |
|
21557 |
### Supported Languages:
|
21558 |
+
While the foundation model supports 100 languages, we've focused our tuning efforts on the following 30 languages:
|
21559 |
**Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
|
21560 |
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
|
21561 |
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
|
|
21598 |
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
|
21599 |
|
21600 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
|
21601 |
+
task = 'retrieval.query'
|
21602 |
+
task_id = model._adaptation_map[task]
|
21603 |
+
adapter_mask = torch.full((len(sentences),), task_id, dtype=torch.int32)
|
21604 |
with torch.no_grad():
|
21605 |
+
model_output = model(**encoded_input, adapter_mask=adapter_mask)
|
21606 |
|
21607 |
embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
|
21608 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
|
|
21663 |
```
|
21664 |
|
21665 |
|
|
|
|
|
|
|
21666 |
The latest version (3.1.0) of [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) also supports `jina-embeddings-v3`:
|
21667 |
|
21668 |
```bash
|