Update README.md

b4c344a verified 9 months ago

7.18 kB

	---
	pipeline_tag: sentence-similarity
	language: fr
	license: mit
	datasets:
	- unicamp-dl/mmarco
	metrics:
	- recall
	tags:
	- sentence-similarity
	- colbert
	base_model: antoinelouis/camembert-L4
	library_name: RAGatouille
	inference: false
	---

	# 🇫🇷 colbertv2-camembert-L4-mmarcoFR

	This is a lightweight [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488) model for French that can be used for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

	## Usage

	Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).

	### Using ColBERT-AI

	First, you will need to install the following libraries:

	```bash
	pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
	```

	Then, you can use the model like this:

	```python
	from colbert import Indexer, Searcher
	from colbert.infra import Run, RunConfig

	n_gpu: int = 1 # Set your number of available GPUs
	experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
	index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
	documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

	# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
	with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
	indexer = Indexer(checkpoint="antoinelouis/colbertv2-camembert-L4-mmarcoFR")
	indexer.index(name=index_name, collection=documents)

	# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
	with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
	searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
	results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
	# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
	```

	### Using RAGatouille

	First, you will need to install the following libraries:

	```bash
	pip install -U ragatouille
	```

	Then, you can use the model like this:

	```python
	from ragatouille import RAGPretrainedModel

	index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
	documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

	# Step 1: Indexing.
	RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv2-camembert-L4-mmarcoFR")
	RAG.index(name=index_name, collection=documents)

	# Step 2: Searching.
	RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
	RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
	```

	***

	## Evaluation

	The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its
	performance with other publicly available 🇫🇷 ColBERT models (as well as one single-vector representation model) fine-tuned on the same dataset. We report the
	mean reciprocal rank (MRR) and recall at various cut-offs (R@k).

	\| model \| #Param.(↓) \| Size \| Dim. \| Index \| R@1000 \| R@500 \| R@100 \| R@10 \| MRR@10 \|
	\|:-----------------------------------------------------------------------------------------------------------\|-----------:\|------:\|-----:\|------:\|-------:\|------:\|------:\|-----:\|-------:\|
	\| colbertv2-camembert-L4-mmarcoFR \| 54M \| 0.2GB \| 32 \| GB \| 91.9 \| 90.3 \| 81.9 \| 56.7 \| 32.3 \|
	\| [FraColBERTv2](https://huggingface.co/bclavie/FraColBERTv2) \| 111M \| 0.4GB \| 128 \| 28GB \| 90.0 \| 88.9 \| 81.2 \| 57.1 \| 32.4 \|
	\| [colbertv1-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR) \| 111M \| 0.4GB \| 128 \| 28GB \| 89.7 \| 88.4 \| 80.0 \| 54.2 \| 29.5 \|
	\| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) \| 111M \| 0.4GB \| 128 \| 28GB \| - \| 89.1 \| 77.8 \| 51.5 \| 28.5 \|

	NB: Index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk when using ColBERTv2's residual compression mechanism.

	***

	## Training

	#### Data

	We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of
	MS MARCO that contains 8.8M passages and 539K training queries. We do not employ the BM25 negatives provided by the official [triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset)
	but instead sample 62 harder negatives mined from 12 distinct dense retrievers for each query, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
	distillation dataset. Next, we collect the relevance scores of an expressive [cross-encoder reranker](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2)
	for all our (query, paragraph) pairs using the [cross-encoder-ms-marco-MiniLM-L-6-v2-scores](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#cross-encoder-ms-marco-minilm-l-6-v2-scorespklgz) dataset.
	Eventually, we end up with 10.4M different 64-way tuples of the form [query, (pos, pos_score), (neg1, neg1_score), ..., (neg62, neg62_score)] for training the model.

	#### Implementation

	The model is initialized from the [camembert-L4](https://huggingface.co/antoinelouis/camembert-L4) checkpoint and optimized via a combination of KL-Divergence loss
	for distilling the cross-encoder scores into the model with the in-batch sampled softmax cross-entropy loss applied to the positive score of each query against all
	passages corresponding to other queries in the same batch (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). The model is fine-tuned on one 80GB NVIDIA
	H100 GPU for 325k steps using the AdamW optimizer with a batch size of 32, a peak learning rate of 1e-5 with warm up along the first 20k steps and linear scheduling.
	The embedding dimension is set to 32, and the maximum sequence lengths for questions and passages length were fixed to 32 and 160 tokens, respectively. We use
	the cosine similarity to compute relevance scores.

	***

	## Citation

	```bibtex
	@online{louis2024,
	author = 'Antoine Louis',
	title = 'colbertv2-camembert-L4-mmarcoFR: A Lightweight ColBERTv2 Model for French',
	publisher = 'Hugging Face',
	month = 'mar',
	year = '2024',
	url = 'https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR',
	}
	```