--- |
pipeline_tag: sentence-similarity |
language: fr |
license: apache-2.0 |
datasets: |
- unicamp-dl/mmarco |
metrics: |
- recall |
tags: |
- sentence-similarity |
library_name: sentence-transformers |
--- |
# crossencoder-mMiniLMv2-L12-mmarcoFR |
This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset. |
It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions. |
## Usage |
*** |
#### Sentence-Transformers |
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
```bash |
pip install -U sentence-transformers |
``` |
Then you can use the model like this: |
```python |
from sentence_transformers import CrossEncoder |
pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')] |
model = CrossEncoder('antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR') |
scores = model.predict(pairs) |
print(scores) |
``` |
#### 🤗 Transformers |
Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows: |
```python |
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
import torch |
model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR') |
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR') |
pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')] |
features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt') |
model.eval() |
with torch.no_grad(): |
scores = model(**features).logits |
print(scores) |
``` |
## Evaluation |
*** |
We evaluated the model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages. |
Below, we compare the model performance with other cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k). |
| | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 | |
|---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:| |
| 1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | fr | 110M | 443MB | 35.65 | 50.44 | 82.95 | 91.50 | 96.80 | 98.80 | |
| 2 | **crossencoder-mMiniLMv2-L12-mmarcoFR** | fr,99+ | 118M | 471MB | 34.37 | 51.01 | 82.23 | 90.60 | 96.45 | 98.40 | |
| 3 | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR) | en | 109M | 438MB | 29.68 | 46.13 | 80.45 | 87.90 | 93.15 | 96.60 | |
| 4 | [crossencoder-distilcamembert-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | fr | 68M | 272MB | 27.28 | 43.71 | 80.30 | 89.10 | 95.55 | 98.60 | |
| 5 | [crossencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR) | fr | 110M | 443MB | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 | |
| 6 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR) | fr,99+ | 107M | 428MB | 33.92 | 49.33 | 79.00 | 88.35 | 94.80 | 98.20 | |
| 7 | [crossencoder-MiniLM-L12-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L12-msmarco-mmarcoFR) | en | 33M | 134MB | 29.07 | 44.41 | 77.83 | 88.10 | 95.55 | 99.00 | |
| 8 | [crossencoder-MiniLM-L6-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-msmarco-mmarcoFR) | en | 23M | 91MB | 32.92 | 47.56 | 77.27 | 88.15 | 94.85 | 98.15 | |
| 9 | [crossencoder-MiniLM-L4-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L4-msmarco-mmarcoFR) | en | 19M | 77MB | 30.98 | 46.22 | 76.35 | 85.80 | 94.35 | 97.55 | |
| 10 | [crossencoder-MiniLM-L2-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L2-msmarco-mmarcoFR) | en | 15M | 62MB | 30.82 | 44.30 | 72.03 | 82.65 | 93.35 | 98.10 | |
## Training |
*** |
#### Background |
We used the [nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant). |
#### Hyperparameters |
We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens. |
#### Data |
We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset. |
## Citation |
*** |
```bibtex |
@online{louis2023, |
author = 'Antoine Louis', |
title = 'crossencoder-mMiniLMv2-L12-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French', |
publisher = 'Hugging Face', |
month = 'september', |
year = '2023', |
url = 'https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR', |
} |
``` |