Update README.md
Browse files
README.md
CHANGED
@@ -9,10 +9,14 @@ tags:
|
|
9 |
|
10 |
---
|
11 |
|
12 |
-
#
|
13 |
|
14 |
-
This is a [
|
15 |
|
|
|
|
|
|
|
|
|
16 |
<!--- Describe your model here -->
|
17 |
|
18 |
## Usage (Sentence-Transformers)
|
@@ -72,6 +76,15 @@ print("Sentence embeddings:")
|
|
72 |
print(sentence_embeddings)
|
73 |
```
|
74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
|
77 |
## Evaluation Results
|
@@ -80,16 +93,20 @@ print(sentence_embeddings)
|
|
80 |
|
81 |
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=TatonkaHF/bge-m3-retromae_en_ru)
|
82 |
|
83 |
-
|
84 |
-
|
85 |
## Full Model Architecture
|
86 |
```
|
87 |
SentenceTransformer(
|
88 |
-
(0): Transformer({'max_seq_length':
|
89 |
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
90 |
)
|
91 |
```
|
92 |
|
93 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
<!--- Describe where people can find more information -->
|
|
|
9 |
|
10 |
---
|
11 |
|
12 |
+
# bge-m3 model for english and russian
|
13 |
|
14 |
+
This is a tokenizer shrinked version of [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae).
|
15 |
|
16 |
+
The current model has only English and Russian tokens left in the vocabulary.
|
17 |
+
Thus, the vocabulary is 21% of the original, and number of parameters in the whole model is 63.3% of the original, without any loss in the quality of English and Russian embeddings.
|
18 |
+
|
19 |
+
Notebook with code is available [here](https://github.com/BlessedTatonka/pet_projects/tree/main/huggingface/bge-m3-shrinking).
|
20 |
<!--- Describe your model here -->
|
21 |
|
22 |
## Usage (Sentence-Transformers)
|
|
|
76 |
print(sentence_embeddings)
|
77 |
```
|
78 |
|
79 |
+
## Specs
|
80 |
+
|
81 |
+
Other bge-m3 models are also shrinked.
|
82 |
+
|
83 |
+
| Model name |
|
84 |
+
|---------------------------|
|
85 |
+
| [bge-m3-retromae_en_ru](https://huggingface.co/TatonkaHF/bge-m3-retromae_en_ru) |
|
86 |
+
| [bge-m3-unsupervised_en_ru](https://huggingface.co/TatonkaHF/bge-m3-unsupervised_en_ru) |
|
87 |
+
| [bge-m3_en_ru](https://huggingface.co/TatonkaHF/bge-m3_en_ru)
|
88 |
|
89 |
|
90 |
## Evaluation Results
|
|
|
93 |
|
94 |
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=TatonkaHF/bge-m3-retromae_en_ru)
|
95 |
|
|
|
|
|
96 |
## Full Model Architecture
|
97 |
```
|
98 |
SentenceTransformer(
|
99 |
+
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
|
100 |
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
101 |
)
|
102 |
```
|
103 |
|
104 |
+
## Reference:
|
105 |
+
|
106 |
+
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu. [BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/abs/2402.03216).
|
107 |
+
|
108 |
+
Inspired by [LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) and [https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/1](https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/1).
|
109 |
+
|
110 |
+
License: [mit](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)
|
111 |
|
112 |
<!--- Describe where people can find more information -->
|