TatonkaHF
/

bge-m3-unsupervised_en_ru

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

TatonkaHF commited on Jun 15

Commit

88b3b2e

•

1 Parent(s): 405b4ed

Update README.md

Files changed (1) hide show

README.md +20 -9

README.md CHANGED Viewed

@@ -9,10 +9,14 @@ tags:
 ---
-# TatonkaHF/bge-m3-unsupervised_en_ru
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 <!--- Describe your model here -->
 ## Usage (Sentence-Transformers)
@@ -72,14 +76,15 @@ print("Sentence embeddings:")
 print(sentence_embeddings)
 ```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=TatonkaHF/bge-m3-unsupervised_en_ru)
 ## Full Model Architecture
@@ -90,6 +95,12 @@ SentenceTransformer(
 )
 ```
-## Citing & Authors
 <!--- Describe where people can find more information -->

 ---
+# bge-m3-unsupervised model for english and russian
+This is a tokenizer shrinked version of [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised).
+The current model has only English and Russian tokens left in the vocabulary.
+Thus, the vocabulary is 21% of the original, and number of parameters in the whole model is 63.3% of the original, without any loss in the quality of English and Russian embeddings.
+Notebook with code is available [here](https://github.com/BlessedTatonka/pet_projects/tree/main/huggingface/bge-m3-shrinking).
 <!--- Describe your model here -->
 ## Usage (Sentence-Transformers)
 print(sentence_embeddings)
 ```
+## Specs
+Other bge-m3 models are also shrinked.
+| Model name                |
+|---------------------------|
+| [bge-m3-retromae_en_ru](https://huggingface.co/TatonkaHF/bge-m3-retromae_en_ru)     |
+| [bge-m3-unsupervised_en_ru](https://huggingface.co/TatonkaHF/bge-m3-unsupervised_en_ru) |
+| [bge-m3_en_ru](https://huggingface.co/TatonkaHF/bge-m3_en_ru)              |
 ## Full Model Architecture
 )
 ```
+## Reference:
+Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu. [BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/abs/2402.03216).
+Inspired by [LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) and [https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/1](https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/1).
+License: [mit](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)
 <!--- Describe where people can find more information -->