carlesoctav commited on
Commit
7a00f13
1 Parent(s): 8d924b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -22
README.md CHANGED
@@ -1,47 +1,107 @@
1
  ---
2
  tags:
3
  - generated_from_keras_callback
4
- model-index:
5
- - name: multiqa-Mmini-L6-H384
6
- results: []
7
  ---
8
 
9
- <!-- This model card has been generated automatically according to the information Keras had access to. You should
10
- probably proofread and complete it, then remove this comment. -->
11
 
12
- # multiqa-Mmini-L6-H384
13
 
14
- This model is a fine-tuned version of [nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) on an unknown dataset.
15
- It achieves the following results on the evaluation set:
16
 
 
17
 
18
- ## Model description
19
 
20
- More information needed
 
21
 
22
- ## Intended uses & limitations
 
23
 
24
- More information needed
 
 
25
 
26
- ## Training and evaluation data
 
 
 
27
 
28
- More information needed
 
 
29
 
30
- ## Training procedure
 
31
 
32
- ### Training hyperparameters
33
 
34
- The following hyperparameters were used during training:
35
- - optimizer: None
36
- - training_precision: float32
37
 
38
- ### Training results
 
 
39
 
 
 
40
 
 
41
 
42
- ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  - Transformers 4.29.1
45
  - TensorFlow 2.12.0
46
  - Datasets 2.12.0
47
- - Tokenizers 0.13.3
 
1
  ---
2
  tags:
3
  - generated_from_keras_callback
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
  ---
8
 
 
 
9
 
10
+ # MultiQA-mMini-L6-H384
11
 
12
+ This model is a fine-tuned version of [nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) on the [carlesoctav/en-id-parallel-sentences](https://huggingface.co/datasets/carlesoctav/en-id-parallel-sentences) dataset using the following procedure: [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://paperswithcode.com/paper/making-monolingual-sentence-embeddings). It achieves 92% accuracy on the validation split of the dataset for the English-Indonesian language pair in the bitext mining task.
 
13
 
14
+ ## Model Description
15
 
16
+ Since we followed the approach outlined in [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://paperswithcode.com/paper/making-monolingual-sentence-embeddings), we used [sentence-transformers/multi-qa-MiniLM-L6-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1) as the teacher model and [nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) as the student model (multilingual).
17
 
18
+ Example of usage:
19
+ ```python
20
 
21
+ from transformers import AutoTokenizer, AutoModel
22
+ import torch
23
 
24
+ #CLS Pooling - Take output from first token
25
+ def cls_pooling(model_output):
26
+ return model_output.last_hidden_state[:,0]
27
 
28
+ #Encode text
29
+ def encode(texts):
30
+ # Tokenize sentences
31
+ encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
32
 
33
+ # Compute token embeddings
34
+ with torch.no_grad():
35
+ model_output = model(**encoded_input, return_dict=True)
36
 
37
+ # Perform pooling
38
+ embeddings = cls_pooling(model_output)
39
 
40
+ return embeddings
41
 
 
 
 
42
 
43
+ # Sentences we want sentence embeddings for
44
+ query = "How many people live in London?"
45
+ docs = ["Around 9 Million people live in London", "London is known for its financial district", "sekitar 9 juta orang tinggal di london", " London terkenal sebagai distrik finansial"]
46
 
47
+ # Load model from HuggingFace Hub
48
+ tokenizer = AutoTokenizer.from_pretrained("carlesoctav/multi-qa-en-id-mMiniLMv2-L6-H384")
49
 
50
+ model = AutoModel.from_pretrained("carlesoctav/multi-qa-en-id-mMiniLMv2-L6-H384", from_tf = True)
51
 
52
+ #Encode query and docs
53
+ query_emb = encode(query)
54
+ doc_emb = encode(docs)
55
+
56
+ #Compute dot score between query and all document embeddings
57
+ scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
58
+
59
+ #Combine docs & scores
60
+ doc_score_pairs = list(zip(docs, scores))
61
+
62
+ #Sort by decreasing score
63
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
64
+
65
+ #Output passages & scores
66
+ for doc, score in doc_score_pairs:
67
+ print(score, doc)
68
+ ```
69
+
70
+
71
+ Take a look at the demo on Google Colab [here](https://colab.research.google.com/drive/1EZb0qACRIug9BVRX7LziKPchpYBUru9e#scrollTo=tZAjbx-_AOsg).
72
+
73
+ ## Intended Uses & Limitations
74
+
75
+ Our model is intended to be used for semantic search. It encodes queries/questions and text paragraphs into dense vectors, allowing it to find relevant documents based on the given passages.
76
+
77
+ The model is designed to create sentence embeddings specifically for semantic search and information retrieval tasks. As the student model, it inherits this capability from the fine-tuned teacher model. It supports both English and Indonesian languages, making it suitable for cross-lingual information retrieval tasks.
78
+
79
+ Please note that there is a limit of 256 word pieces, and any text longer than that will be truncated. Additionally, the model was trained using input text up to 80 word pieces, so it may not perform optimally on longer text.
80
+
81
+ In the following some technical details how this model must be used:
82
+
83
+ | Setting | Value |
84
+ | --- | :---: |
85
+ | Dimensions | 384 |
86
+ | Produces normalized embeddings | No |
87
+ | Pooling-Method | CLS pooling |
88
+ | Suitable score functions | dot-product (e.g. `util.dot_score`) |
89
+
90
+ ----
91
+
92
+ ## Training and Evaluation Data
93
+
94
+ We utilized the [carlesoctav/en-id-parallel-sentences](https://huggingface.co/datasets/carlesoctav/en-id-parallel-sentences) dataset for training and evaluation purposes. The data was dynamically split into 95% for training and 5% for validation.
95
+
96
+ ## Training Procedure
97
+
98
+ The complete training script can be found in the current repository under the name `train.py`.
99
+
100
+ ### Framework Versions
101
+
102
+ The following framework versions were used:
103
 
104
  - Transformers 4.29.1
105
  - TensorFlow 2.12.0
106
  - Datasets 2.12.0
107
+ - Tokenizers 0.13.3