File size: 6,482 Bytes
64a19d4
 
 
 
 
 
 
 
 
 
 
 
30b61bb
64a19d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30b61bb
64a19d4
 
 
 
 
 
 
 
 
 
 
 
30b61bb
 
64a19d4
 
 
 
 
 
 
 
 
 
 
 
 
bbfbdca
64a19d4
bbfbdca
64a19d4
30b61bb
 
 
 
 
 
 
 
 
 
 
 
64a19d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30b61bb
64a19d4
 
 
30b61bb
64a19d4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
pipeline_tag: sentence-similarity
language: fr
license: apache-2.0
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- sentence-similarity
library_name: sentence-transformers
---
# crossencoder-mMiniLMv2-L12-mmarcoFR

This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.

## Usage
***

#### Sentence-Transformers

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```bash
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import CrossEncoder
pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]

model = CrossEncoder('antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR')
scores = model.predict(pairs)
print(scores)
```

#### 🤗 Transformers

Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR')
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR')

pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')

model.eval()
with torch.no_grad():
    scores = model(**features).logits
print(scores)
```

## Evaluation
***

We evaluated the model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages. 

Below, we compare the model performance with other cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).

|    | model                                                                                                                        | Vocab. | #Param. |  Size |     RP |   MRR@10 |  R@10(↑) |   R@20 |   R@50 |   R@100 |
|---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
|  1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR)             |     fr |    110M | 443MB |  35.65 |    50.44 |    82.95 |  91.50 |  96.80 |   98.80 |
|  2 | **crossencoder-mMiniLMv2-L12-mmarcoFR**                                                                                      | fr,99+ |    118M | 471MB |  34.37 |    51.01 |    82.23 |  90.60 |  96.45 |   98.40 |
|  3 | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR)                     |     en |    109M | 438MB |  29.68 |    46.13 |    80.45 |  87.90 |  93.15 |   96.60 |
|  4 | [crossencoder-distilcamembert-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR)           |     fr |     68M | 272MB |  27.28 |    43.71 |    80.30 |  89.10 |  95.55 |   98.60 |
|  5 | [crossencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR)   |     fr |    110M | 443MB |  28.32 |    45.28 |    79.22 |  87.15 |  93.15 |   95.75 |
|  6 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR)                 | fr,99+ |    107M | 428MB |  33.92 |    49.33 |    79.00 |  88.35 |  94.80 |   98.20 |
|  7 | [crossencoder-MiniLM-L12-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L12-msmarco-mmarcoFR)     |     en |     33M | 134MB |  29.07 |    44.41 |    77.83 |  88.10 |  95.55 |   99.00 |
|  8 | [crossencoder-MiniLM-L6-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-msmarco-mmarcoFR)       |     en |     23M |  91MB |  32.92 |    47.56 |    77.27 |  88.15 |  94.85 |   98.15 |
|  9 | [crossencoder-MiniLM-L4-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L4-msmarco-mmarcoFR)       |     en |     19M |  77MB |  30.98 |    46.22 |    76.35 |  85.80 |  94.35 |   97.55 |
| 10 | [crossencoder-MiniLM-L2-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L2-msmarco-mmarcoFR)       |     en |     15M |  62MB |  30.82 |    44.30 |    72.03 |  82.65 |  93.35 |   98.10 |

## Training
***

#### Background

We used the [nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant).

#### Hyperparameters

We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.

#### Data

We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset.

## Citation
***

```bibtex
@online{louis2023,
   author    = 'Antoine Louis',
   title     = 'crossencoder-mMiniLMv2-L12-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French',
   publisher = 'Hugging Face',
   month     = 'september',
   year      = '2023',
   url       = 'https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR',
}
```