Update README.md
Browse files
README.md
CHANGED
@@ -2622,31 +2622,35 @@ model-index:
|
|
2622 |
|
2623 |
## Intended Usage & Model Info
|
2624 |
|
2625 |
-
`jina-embedding-b-en-v2` is an English, monolingual embedding model supporting
|
2626 |
-
It is based on a Bert architecture that supports the symmetric bidirectional variant of ALiBi to support longer sequence length.
|
2627 |
-
The backbone
|
2628 |
-
The model is further trained on Jina AI's collection of more than
|
2629 |
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
|
2630 |
|
2631 |
-
The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length thanks to ALiBi.
|
2632 |
This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search,...
|
2633 |
|
2634 |
-
|
2635 |
Additionally, we provide the following embedding models, supporting 8k sequence length as well:
|
2636 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2637 |
- [`jina-embedding-s-en-v2`](https://huggingface.co/jinaai/jina-embedding-s-en-v2): 33 million parameters.
|
2638 |
- [`jina-embedding-b-en-v2`](https://huggingface.co/jinaai/jina-embedding-b-en-v2): 137 million parameters **(you are here)**.
|
2639 |
- [`jina-embedding-l-en-v2`](https://huggingface.co/jinaai/jina-embedding-l-en-v2): 435 million parameters.
|
2640 |
|
2641 |
## Data & Parameters
|
2642 |
-
<!-- TODO: update the paper ID once it is published on arxiv -->
|
2643 |
-
Please checkout our [technical blog](https://arxiv.org/abs/2307.11224).
|
2644 |
|
2645 |
-
|
2646 |
|
2647 |
-
|
2648 |
-
|
2649 |
-
<!-- TODO: add evaluation table here -->
|
2650 |
|
2651 |
## Usage
|
2652 |
|
@@ -2662,20 +2666,13 @@ embeddings = model.encode(['How is the weather today?', 'What is the current wea
|
|
2662 |
print(cos_sim(embeddings[0], embeddings[1]))
|
2663 |
```
|
2664 |
|
2665 |
-
|
2666 |
-
We include an experimental implementation for Flash Attention, shipped with the model.
|
2667 |
-
Install the following triton version:
|
2668 |
-
`pip install triton==2.0.0.dev20221202`.
|
2669 |
-
Now run the same code above, but make sure to set the parameter `with_flash` to `True` when you load the model. You also have to use either `fp16` or `bf16`:
|
2670 |
-
```python
|
2671 |
-
from transformers import AutoModel
|
2672 |
-
from numpy.linalg import norm
|
2673 |
-
import torch
|
2674 |
|
2675 |
-
|
2676 |
-
|
2677 |
-
|
2678 |
-
|
|
|
2679 |
```
|
2680 |
|
2681 |
## Fine-tuning
|
@@ -2683,7 +2680,8 @@ print(cos_sim(embeddings[0], embeddings[1]))
|
|
2683 |
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
|
2684 |
|
2685 |
## Plans
|
2686 |
-
|
|
|
2687 |
|
2688 |
## Contact
|
2689 |
|
|
|
2622 |
|
2623 |
## Intended Usage & Model Info
|
2624 |
|
2625 |
+
`jina-embedding-b-en-v2` is an English, monolingual **embedding model** supporting **8192 sequence length**.
|
2626 |
+
It is based on a Bert architecture (Jina Bert) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to support longer sequence length.
|
2627 |
+
The backbone `jina-bert-b-en-v2` is pretrained on the C4 dataset.
|
2628 |
+
The model is further trained on Jina AI's collection of more than 400 millions of sentence pairs and hard negatives.
|
2629 |
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
|
2630 |
|
2631 |
+
The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
|
2632 |
This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search,...
|
2633 |
|
2634 |
+
With a standard size of 137 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference.
|
2635 |
Additionally, we provide the following embedding models, supporting 8k sequence length as well:
|
2636 |
|
2637 |
+
### V1 (Based on T5)
|
2638 |
+
|
2639 |
+
- [`jina-embedding-s-en-v1`](https://huggingface.co/jinaai/jina-embedding-s-en-v1): 35 million parameters.
|
2640 |
+
- [`jina-embedding-b-en-v1`](https://huggingface.co/jinaai/jina-embedding-b-en-v1): 110 million parameters.
|
2641 |
+
- [`jina-embedding-l-en-v1`](https://huggingface.co/jinaai/jina-embedding-l-en-v1): 330 million parameters.
|
2642 |
+
|
2643 |
+
### V2 (Based on JinaBert)
|
2644 |
+
|
2645 |
- [`jina-embedding-s-en-v2`](https://huggingface.co/jinaai/jina-embedding-s-en-v2): 33 million parameters.
|
2646 |
- [`jina-embedding-b-en-v2`](https://huggingface.co/jinaai/jina-embedding-b-en-v2): 137 million parameters **(you are here)**.
|
2647 |
- [`jina-embedding-l-en-v2`](https://huggingface.co/jinaai/jina-embedding-l-en-v2): 435 million parameters.
|
2648 |
|
2649 |
## Data & Parameters
|
|
|
|
|
2650 |
|
2651 |
+
Jina Embedding V2 technical report coming soon.
|
2652 |
|
2653 |
+
Jina Embedding V1 [technical report](https://arxiv.org/abs/2307.11224).
|
|
|
|
|
2654 |
|
2655 |
## Usage
|
2656 |
|
|
|
2666 |
print(cos_sim(embeddings[0], embeddings[1]))
|
2667 |
```
|
2668 |
|
2669 |
+
If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2670 |
|
2671 |
+
```python
|
2672 |
+
embeddings = model.encode(
|
2673 |
+
['Very long ... document'],
|
2674 |
+
max_length=2048
|
2675 |
+
)
|
2676 |
```
|
2677 |
|
2678 |
## Fine-tuning
|
|
|
2680 |
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
|
2681 |
|
2682 |
## Plans
|
2683 |
+
|
2684 |
+
The development of new bilingual models is currently underway. We will be targeting mainly the German and Spanish languages. The upcoming models will be called `jina-embedding-b-de/es-v2`.
|
2685 |
|
2686 |
## Contact
|
2687 |
|