metadata

pipeline_tag: sentence-similarity
language: fr
license: mit
datasets:
  - unicamp-dl/mmarco
metrics:
  - recall
tags:
  - colbert
  - passage-retrieval
base_model: camembert-base
library_name: RAGatouille
inference: false
model-index:
  - name: colbertv1-camembert-base-mmarcoFR
    results:
      - task:
          type: sentence-similarity
          name: Passage Retrieval
        dataset:
          type: unicamp-dl/mmarco
          name: mMARCO-fr
          config: french
          split: validation
        metrics:
          - type: recall_at_1000
            name: Recall@1000
            value: 89.7
          - type: recall_at_500
            name: Recall@500
            value: 88.4
          - type: recall_at_100
            name: Recall@100
            value: 80
          - type: recall_at_10
            name: Recall@10
            value: 54.2
          - type: mrr_at_10
            name: MRR@10
            value: 29.5

colbertv1-camembert-base-mmarcoFR

This is a ColBERTv1 model for French that can be used for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

Usage

Here are some examples for using the model with RAGatouille or colbert-ai.

Using RAGatouille

First, you will need to install the following libraries:

pip install -U ragatouille

Then, you can use the model like this:

from ragatouille import RAGPretrainedModel

index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

# Step 1: Indexing.
RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
RAG.index(name=index_name, collection=documents)

# Step 2: Searching.
RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)

Using ColBERT-AI

First, you will need to install the following libraries:

pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2

Then, you can use the model like this:

from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig

n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
    indexer.index(name=index_name, collection=documents)

# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
    results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)

Evaluation

The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k). Below, we compare its performance with other publicly available French ColBERT models fine-tuned on the same dataset. To see how it compares to other neural retrievers in French, check out the DécouvrIR leaderboard.

model	#Param.(↓)	Size	Dim.	Index	R@1000	R@500	R@100	R@10	MRR@10
colbertv2-camembert-L4-mmarcoFR	54M	0.2GB	32	9GB	91.9	90.3	81.9	56.7	32.3
FraColBERTv2	111M	0.4GB	128	28GB	90.0	88.9	81.2	57.1	32.4
colbertv1-camembert-base-mmarcoFR	111M	0.4GB	128	28GB	89.7	88.4	80.0	54.2	29.5

NB: Index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk when using ColBERTv2's residual compression mechanism.

Training

Data

We use the French training set from the mMARCO dataset, a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries. We sample 12.8M (q, p+, p-) triples from the official ~39.8M training triples.

Implementation

The model is initialized from the camembert-base checkpoint and optimized via a combination of the pairwise softmax cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in ColBERTv1) and the in-batch sampled softmax cross-entropy loss (as in ColBERTv2). It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.

Citation

@online{louis2024decouvrir,
    author    = 'Antoine Louis',
    title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
    publisher = 'Hugging Face',
    month     = 'mar',
    year      = '2024',
    url       = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
}