File size: 4,691 Bytes
f081e16
 
 
7a00f13
 
771c94b
 
 
 
 
 
 
f081e16
 
 
7a00f13
f081e16
7a00f13
f081e16
7a00f13
f081e16
7a00f13
f081e16
7a00f13
 
f081e16
7a00f13
 
f081e16
7a00f13
 
 
f081e16
7a00f13
 
 
 
f081e16
7a00f13
 
 
f081e16
7a00f13
 
f081e16
7a00f13
f081e16
 
7a00f13
 
 
f081e16
7a00f13
 
f081e16
7a00f13
f081e16
7a00f13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f081e16
 
 
 
7a00f13
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
tags:
- generated_from_keras_callback
- feature-extraction
- sentence-similarity
license: apache-2.0
datasets:
- carlesoctav/en-id-parallel-sentences-embedding
- carlesoctav/en-id-parallel-sentences
language:
- en
- id
---


# MultiQA-mMini-L6-H384

This model is a fine-tuned version of [nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) on the [carlesoctav/en-id-parallel-sentences](https://huggingface.co/datasets/carlesoctav/en-id-parallel-sentences) dataset using the following procedure: [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://paperswithcode.com/paper/making-monolingual-sentence-embeddings). It achieves 92% accuracy on the validation split of the dataset for the English-Indonesian language pair in the bitext mining task.

## Model Description

Since we followed the approach outlined in [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://paperswithcode.com/paper/making-monolingual-sentence-embeddings), we used [sentence-transformers/multi-qa-MiniLM-L6-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1) as the teacher model and [nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) as the student model (multilingual).

Example of usage:
```python

from transformers import AutoTokenizer, AutoModel
import torch

#CLS Pooling - Take output from first token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = cls_pooling(model_output)

    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district", "sekitar 9 juta orang tinggal di london", " London terkenal sebagai distrik finansial"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("carlesoctav/multi-qa-en-id-mMiniLMv2-L6-H384")

model = AutoModel.from_pretrained("carlesoctav/multi-qa-en-id-mMiniLMv2-L6-H384", from_tf = True)

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)
```


Take a look at the demo on Google Colab [here](https://colab.research.google.com/drive/1EZb0qACRIug9BVRX7LziKPchpYBUru9e#scrollTo=tZAjbx-_AOsg).

## Intended Uses & Limitations

Our model is intended to be used for semantic search. It encodes queries/questions and text paragraphs into dense vectors, allowing it to find relevant documents based on the given passages.

The model is designed to create sentence embeddings specifically for semantic search and information retrieval tasks. As the student model, it inherits this capability from the fine-tuned teacher model. It supports both English and Indonesian languages, making it suitable for cross-lingual information retrieval tasks.

Please note that there is a limit of 256 word pieces, and any text longer than that will be truncated. Additionally, the model was trained using input text up to 80 word pieces, so it may not perform optimally on longer text.

In the following some technical details how this model must be used:

| Setting | Value |
| --- | :---: |
| Dimensions | 384 |
| Produces normalized embeddings | No |
| Pooling-Method | CLS pooling |
| Suitable score functions | dot-product (e.g. `util.dot_score`) |

----

## Training and Evaluation Data

We utilized the [carlesoctav/en-id-parallel-sentences](https://huggingface.co/datasets/carlesoctav/en-id-parallel-sentences) dataset for training and evaluation purposes. The data was dynamically split into 95% for training and 5% for validation.

## Training Procedure

The complete training script can be found in the current repository under the name `train.py`.

### Framework Versions

The following framework versions were used:

- Transformers 4.29.1
- TensorFlow 2.12.0
- Datasets 2.12.0
- Tokenizers 0.13.3