jinaai
/

jina-embeddings-v2-base-zh

@@ -1083,7 +1083,7 @@ It is based on a BERT architecture (JinaBERT) that supports the symmetric bidire
 We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed Chinese-English input without bias.
 Additionally, we provide the following embedding models:
-`jina-embeddings-v2-base-zh` 是支持中英双语的文本向量模型，它支持长达8192字符的文本编码。
 该模型的研发基于BERT架构(JinaBERT)，JinaBERT是在BERT架构基础上的改进，首次将[ALiBi](https://arxiv.org/abs/2108.12409)应用到编码器架构中以支持更长的序列。
 不同于以往的单语言/多语言向量模型，我们设计双语模型来更好的支持单语言（中搜中）以及跨语言（中搜英）文档检索。
 除此之外，我们也提供其它向量模型:
@@ -1121,10 +1121,10 @@ def mean_pooling(model_output, attention_mask):
     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-sentences = ['How is the weather today?', 'What is the current weather like today?']
-tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-small-en')
-model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True)
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -1145,8 +1145,8 @@ from transformers import AutoModel
 from numpy.linalg import norm
 cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
-model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
-embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
 print(cos_sim(embeddings[0], embeddings[1]))
 ```

 We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed Chinese-English input without bias.
 Additionally, we provide the following embedding models:
+`jina-embeddings-v2-base-zh` 是支持中英双语的文本**向量**模型，它支持长达**8192字符**的文本编码。
 该模型的研发基于BERT架构(JinaBERT)，JinaBERT是在BERT架构基础上的改进，首次将[ALiBi](https://arxiv.org/abs/2108.12409)应用到编码器架构中以支持更长的序列。
 不同于以往的单语言/多语言向量模型，我们设计双语模型来更好的支持单语言（中搜中）以及跨语言（中搜英）文档检索。
 除此之外，我们也提供其它向量模型:
     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+sentences = ['How is the weather today?', '今天天气怎么样?']
+tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
+model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 from numpy.linalg import norm
 cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
+model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True) # trust_remote_code is needed to use the encode method
+embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
 print(cos_sim(embeddings[0], embeddings[1]))
 ```