Update README.md
Browse files
README.md
CHANGED
@@ -1083,7 +1083,7 @@ It is based on a BERT architecture (JinaBERT) that supports the symmetric bidire
|
|
1083 |
We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed Chinese-English input without bias.
|
1084 |
Additionally, we provide the following embedding models:
|
1085 |
|
1086 |
-
`jina-embeddings-v2-base-zh`
|
1087 |
该模型的研发基于BERT架构(JinaBERT),JinaBERT是在BERT架构基础上的改进,首次将[ALiBi](https://arxiv.org/abs/2108.12409)应用到编码器架构中以支持更长的序列。
|
1088 |
不同于以往的单语言/多语言向量模型,我们设计双语模型来更好的支持单语言(中搜中)以及跨语言(中搜英)文档检索。
|
1089 |
除此之外,我们也提供其它向量模型:
|
@@ -1121,10 +1121,10 @@ def mean_pooling(model_output, attention_mask):
|
|
1121 |
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
1122 |
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
1123 |
|
1124 |
-
sentences = ['How is the weather today?', '
|
1125 |
|
1126 |
-
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-
|
1127 |
-
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-
|
1128 |
|
1129 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
1130 |
|
@@ -1145,8 +1145,8 @@ from transformers import AutoModel
|
|
1145 |
from numpy.linalg import norm
|
1146 |
|
1147 |
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
1148 |
-
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-
|
1149 |
-
embeddings = model.encode(['How is the weather today?', '
|
1150 |
print(cos_sim(embeddings[0], embeddings[1]))
|
1151 |
```
|
1152 |
|
|
|
1083 |
We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed Chinese-English input without bias.
|
1084 |
Additionally, we provide the following embedding models:
|
1085 |
|
1086 |
+
`jina-embeddings-v2-base-zh` 是支持中英双语的文本**向量**模型,它支持长达**8192字符**的文本编码。
|
1087 |
该模型的研发基于BERT架构(JinaBERT),JinaBERT是在BERT架构基础上的改进,首次将[ALiBi](https://arxiv.org/abs/2108.12409)应用到编码器架构中以支持更长的序列。
|
1088 |
不同于以往的单语言/多语言向量模型,我们设计双语模型来更好的支持单语言(中搜中)以及跨语言(中搜英)文档检索。
|
1089 |
除此之外,我们也提供其它向量模型:
|
|
|
1121 |
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
1122 |
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
1123 |
|
1124 |
+
sentences = ['How is the weather today?', '今天天气怎么样?']
|
1125 |
|
1126 |
+
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
|
1127 |
+
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
|
1128 |
|
1129 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
1130 |
|
|
|
1145 |
from numpy.linalg import norm
|
1146 |
|
1147 |
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
1148 |
+
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True) # trust_remote_code is needed to use the encode method
|
1149 |
+
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
|
1150 |
print(cos_sim(embeddings[0], embeddings[1]))
|
1151 |
```
|
1152 |
|