TatonkaHF commited on
Commit
edb0e91
1 Parent(s): 3c8e2ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -6
README.md CHANGED
@@ -9,10 +9,14 @@ tags:
9
 
10
  ---
11
 
12
- # TatonkaHF/bge-m3-retromae_en_ru
13
 
14
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
15
 
 
 
 
 
16
  <!--- Describe your model here -->
17
 
18
  ## Usage (Sentence-Transformers)
@@ -72,6 +76,15 @@ print("Sentence embeddings:")
72
  print(sentence_embeddings)
73
  ```
74
 
 
 
 
 
 
 
 
 
 
75
 
76
 
77
  ## Evaluation Results
@@ -80,16 +93,20 @@ print(sentence_embeddings)
80
 
81
  For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=TatonkaHF/bge-m3-retromae_en_ru)
82
 
83
-
84
-
85
  ## Full Model Architecture
86
  ```
87
  SentenceTransformer(
88
- (0): Transformer({'max_seq_length': 8194, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
89
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
90
  )
91
  ```
92
 
93
- ## Citing & Authors
 
 
 
 
 
 
94
 
95
  <!--- Describe where people can find more information -->
 
9
 
10
  ---
11
 
12
+ # bge-m3 model for english and russian
13
 
14
+ This is a tokenizer shrinked version of [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae).
15
 
16
+ The current model has only English and Russian tokens left in the vocabulary.
17
+ Thus, the vocabulary is 21% of the original, and number of parameters in the whole model is 63.3% of the original, without any loss in the quality of English and Russian embeddings.
18
+
19
+ Notebook with code is available [here](https://github.com/BlessedTatonka/pet_projects/tree/main/huggingface/bge-m3-shrinking).
20
  <!--- Describe your model here -->
21
 
22
  ## Usage (Sentence-Transformers)
 
76
  print(sentence_embeddings)
77
  ```
78
 
79
+ ## Specs
80
+
81
+ Other bge-m3 models are also shrinked.
82
+
83
+ | Model name |
84
+ |---------------------------|
85
+ | [bge-m3-retromae_en_ru](https://huggingface.co/TatonkaHF/bge-m3-retromae_en_ru) |
86
+ | [bge-m3-unsupervised_en_ru](https://huggingface.co/TatonkaHF/bge-m3-unsupervised_en_ru) |
87
+ | [bge-m3_en_ru](https://huggingface.co/TatonkaHF/bge-m3_en_ru)
88
 
89
 
90
  ## Evaluation Results
 
93
 
94
  For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=TatonkaHF/bge-m3-retromae_en_ru)
95
 
 
 
96
  ## Full Model Architecture
97
  ```
98
  SentenceTransformer(
99
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
100
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
101
  )
102
  ```
103
 
104
+ ## Reference:
105
+
106
+ Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu. [BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/abs/2402.03216).
107
+
108
+ Inspired by [LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) and [https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/1](https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/1).
109
+
110
+ License: [mit](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)
111
 
112
  <!--- Describe where people can find more information -->