|
--- |
|
pipeline_tag: sentence-similarity |
|
language: fr |
|
license: mit |
|
datasets: |
|
- unicamp-dl/mmarco |
|
metrics: |
|
- recall |
|
tags: |
|
- feature-extraction |
|
- sentence-similarity |
|
library_name: sentence-transformers |
|
--- |
|
|
|
<h1 align="center">biencoder-camembert-L8-mmarcoFR</h1> |
|
|
|
|
|
<h4 align="center"> |
|
<p> |
|
<a href=#usage>🛠️ Usage</a> | |
|
<a href="#evaluation">📊 Evaluation</a> | |
|
<a href="#train">🤖 Training</a> | |
|
<a href="#citation">🔗 Citation</a> |
|
<p> |
|
</h4> |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model. It maps questions and paragraphs 768-dimensional dense vectors and should be used for semantic search. |
|
The model uses an [CamemBERT-L8](https://huggingface.co/antoinelouis/camembert-L8) backbone, which is a pruned version of the pre-trained [CamemBERT](https://huggingface.co/camembert-base) |
|
checkpoint with 26% less parameters, obtained by [dropping the top-layers](https://doi.org/10.48550/arXiv.2004.03844) from the original model. |
|
The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) retrieval dataset. |
|
|
|
## Usage |
|
|
|
Here are some examples for using this model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers). |
|
|
|
#### Using Sentence-Transformers |
|
|
|
Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."] |
|
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."] |
|
|
|
model = SentenceTransformer('antoinelouis/biencoder-camembert-L8-mmarcoFR') |
|
|
|
q_embeddings = model.encode(queries, normalize_embeddings=True) |
|
p_embeddings = model.encode(passages, normalize_embeddings=True) |
|
|
|
similarity = q_embeddings @ p_embeddings.T |
|
print(similarity) |
|
``` |
|
|
|
#### Using FlagEmbedding |
|
|
|
Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this: |
|
|
|
```python |
|
from FlagEmbedding import FlagModel |
|
|
|
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."] |
|
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."] |
|
|
|
model = FlagModel('antoinelouis/biencoder-camembert-L8-mmarcoFR') |
|
|
|
q_embeddings = model.encode(queries, normalize_embeddings=True) |
|
p_embeddings = model.encode(passages, normalize_embeddings=True) |
|
|
|
similarity = q_embeddings @ p_embeddings.T |
|
print(similarity) |
|
``` |
|
|
|
#### Using Transformers |
|
|
|
Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this: |
|
|
|
```python |
|
import torch |
|
from torch.nn.functional import normalize |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
def mean_pooling(model_output, attention_mask): |
|
""" Perform mean pooling on-top of the contextualized word embeddings, while ignoring mask tokens in the mean computation.""" |
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
|
|
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."] |
|
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."] |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-camembert-L8-mmarcoFR') |
|
model = AutoModel.from_pretrained('antoinelouis/biencoder-camembert-L8-mmarcoFR') |
|
|
|
q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt') |
|
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt') |
|
with torch.no_grad(): |
|
q_output = model(**encoded_queries) |
|
p_output = model(**encoded_passages) |
|
q_embeddings = mean_pooling(q_output, q_input['attention_mask']) |
|
q_embedddings = normalize(q_embeddings, p=2, dim=1) |
|
p_embeddings = mean_pooling(p_output, p_input['attention_mask']) |
|
p_embedddings = normalize(p_embeddings, p=2, dim=1) |
|
|
|
similarity = q_embeddings @ p_embeddings.T |
|
print(similarity) |
|
``` |
|
|
|
*** |
|
|
|
## Evaluation |
|
|
|
We evaluate the model on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compare the model performance with other CamemBERT-based biencoder models fine-tuned on the same dataset. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k). |
|
|
|
| | model | #Param. | Size | R@500 | R@100(↑) | R@10 | MRR@10 | NDCG@10 | MAP@10 | |
|
|---:|:-------------------------------------------------------------------------------------------------------------|--------:|------:|-------:|---------:|-------:|-------:|--------:|-------:| |
|
| 1 | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 111M | 445MB | 89.1 | 77.8 | 51.5 | 28.5 | 33.7 | 27.9 | |
|
| 2 | [biencoder-camembert-L10-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-L10-mmarcoFR) | 96M | 386MB | 87.8 | 76.7 | 49.5 | 27.5 | 32.5 | 27.0 | |
|
| 3 | **biencoder-camembert-L8-mmarcoFR** | 82M | 329MB | 87.4 | 75.9 | 48.9 | 26.7 | 31.8 | 26.2 | |
|
| 4 | [biencoder-camembert-L6-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-L6-mmarcoFR) | 68M | 272MB | 86.7 | 74.9 | 46.7 | 25.7 | 30.4 | 25.1 | |
|
| 5 | [biencoder-camembert-L4-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-L4-mmarcoFR) | 54M | 216MB | 85.4 | 72.1 | 44.2 | 23.7 | 28.3 | 23.2 | |
|
| 6 | [biencoder-camembert-L2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-L2-mmarcoFR) | 40M | 159MB | 81.0 | 66.3 | 38.5 | 20.1 | 24.3 | 19.7 | |
|
|
|
*** |
|
|
|
## Training |
|
|
|
#### Data |
|
|
|
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO |
|
that contains 8.8M passages and 539K training queries. We do not employ the BM25 negatives provided by the official dataset but instead sample harder negatives mined |
|
from 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives) distillation dataset. |
|
|
|
#### Implementation |
|
|
|
The model is initialized from the [camembert-L8](https://huggingface.co/antoinelouis/camembert-L8) checkpoint and optimized via the cross-entropy loss |
|
(as in [DPR](https://doi.org/10.48550/arXiv.2004.04906)) with a temperature of 0.05. It is fine-tuned on one 32GB NVIDIA V100 GPU for 39k steps (or 40 epochs) |
|
using the AdamW optimizer with a batch size of 512, a peak learning rate of 2e-5 with warm up along the first 3900 steps and linear scheduling. |
|
We set the maximum sequence lengths for both the questions and passages to 128 tokens. We use the cosine similarity to compute relevance scores. |
|
|
|
*** |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@online{louis2023, |
|
author = 'Antoine Louis', |
|
title = 'biencoder-camembert-L8-mmarcoFR: A Biencoder Model Trained on French mMARCO', |
|
publisher = 'Hugging Face', |
|
month = 'may', |
|
year = '2023', |
|
url = 'https://huggingface.co/antoinelouis/biencoder-camembert-L8-mmarcoFR', |
|
} |
|
``` |