license: cc-by-nc-4.0
tags:
- feature-extraction
- sentence-similarity
- mteb
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
inference: false
library_name: transformers
The embedding set trained by Jina AI.
Jina Embedding V3: A Multilingual Multi-Task Embedding Model
Quick Start
The easiest way to starting using jina-embeddings-v3
is to use Jina AI's Embedding API.
Intended Usage & Model Info
jina-embeddings-v3
is a multilingual text embedding model supporting 8192 sequence length.
It is based on a XLMRoBERTa architecture (JinaXLMRoBERTa) that supports the Rotary Position Embeddings to allow longer sequence length.
The backbone JinaXLMRoBERTa
is pretrained on variable length textual data on Mask Language Modeling objective for 160k steps on 89 languages.
The model is further trained on Jina AI's collection of more than 500 millions of multilingual sentence pairs and hard negatives.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
jina-embeddings-v3
has 5 task-specific LoRA adapters tuned on top of our backbone, add task_type
as additional parameter when using the model:
TODO UPDATE THIS
- query: Handles user incoming queries at search time.
- index: Manages user documents submitted for indexing.
- text-matching: Processes symmetric text similarity tasks, whether short or long, such as STS (Semantic Textual Similarity).
- classification: Classifies user inputs into predefined categories.
- clustering: Facilitates the clustering of embeddings for further analysis.
jina-embeddings-v3
supports Matryoshka representation learning. We recommend using an embedding size of 128 or higher (1024 provides optimal performance) for storing your embeddings.
Data & Parameters
coming soon.
Usage
- The easiest way to starting using jina-clip-v1-en is to use Jina AI's Embeddings API.
- Alternatively, you can use Jina CLIP directly via transformers package.
!pip install transformers einops flash_attn
from transformers import AutoModel
# Initialize the model
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3, trust_remote_code=True)
# New meaningful sentences
sentences = [
"Organic skincare for sensitive skin with aloe vera and chamomile.",
"New makeup trends focus on bold colors and innovative techniques",
"Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille",
"Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken",
"Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla",
"Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras",
"针对敏感肌专门设计的天然有机护肤产品",
"新的化妆趋势注重鲜艳的颜色和创新的技巧",
"敏感肌のために特別に設計された天然有機スキンケア製品",
"新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています",
]
# Encode sentences
embeddings = model.encode(sentences, truncate_dim=1024, task_type='index') # TODO UPDATE
# Compute similarities
print(embeddings[0] @ embeddings[1].T)
Performance
TODO UPDATE THIS
Contact
Join our Discord community and chat with other community members about ideas.
Citation
If you find jina-embeddings-v3
useful in your research, please cite the following paper: