Update README.md
Browse files
README.md
CHANGED
@@ -9,10 +9,14 @@ tags:
|
|
9 |
|
10 |
---
|
11 |
|
12 |
-
#
|
13 |
|
14 |
-
This is a [
|
15 |
|
|
|
|
|
|
|
|
|
16 |
<!--- Describe your model here -->
|
17 |
|
18 |
## Usage (Sentence-Transformers)
|
@@ -72,14 +76,15 @@ print("Sentence embeddings:")
|
|
72 |
print(sentence_embeddings)
|
73 |
```
|
74 |
|
|
|
75 |
|
|
|
76 |
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
|
84 |
|
85 |
## Full Model Architecture
|
@@ -90,6 +95,12 @@ SentenceTransformer(
|
|
90 |
)
|
91 |
```
|
92 |
|
93 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
<!--- Describe where people can find more information -->
|
|
|
9 |
|
10 |
---
|
11 |
|
12 |
+
# bge-m3-unsupervised model for english and russian
|
13 |
|
14 |
+
This is a tokenizer shrinked version of [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised).
|
15 |
|
16 |
+
The current model has only English and Russian tokens left in the vocabulary.
|
17 |
+
Thus, the vocabulary is 21% of the original, and number of parameters in the whole model is 63.3% of the original, without any loss in the quality of English and Russian embeddings.
|
18 |
+
|
19 |
+
Notebook with code is available [here](https://github.com/BlessedTatonka/pet_projects/tree/main/huggingface/bge-m3-shrinking).
|
20 |
<!--- Describe your model here -->
|
21 |
|
22 |
## Usage (Sentence-Transformers)
|
|
|
76 |
print(sentence_embeddings)
|
77 |
```
|
78 |
|
79 |
+
## Specs
|
80 |
|
81 |
+
Other bge-m3 models are also shrinked.
|
82 |
|
83 |
+
| Model name |
|
84 |
+
|---------------------------|
|
85 |
+
| [bge-m3-retromae_en_ru](https://huggingface.co/TatonkaHF/bge-m3-retromae_en_ru) |
|
86 |
+
| [bge-m3-unsupervised_en_ru](https://huggingface.co/TatonkaHF/bge-m3-unsupervised_en_ru) |
|
87 |
+
| [bge-m3_en_ru](https://huggingface.co/TatonkaHF/bge-m3_en_ru) |
|
|
|
88 |
|
89 |
|
90 |
## Full Model Architecture
|
|
|
95 |
)
|
96 |
```
|
97 |
|
98 |
+
## Reference:
|
99 |
+
|
100 |
+
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu. [BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/abs/2402.03216).
|
101 |
+
|
102 |
+
Inspired by [LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) and [https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/1](https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/1).
|
103 |
+
|
104 |
+
License: [mit](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)
|
105 |
|
106 |
<!--- Describe where people can find more information -->
|