pipeline_tag: feature-extraction
language: fr
license: apache-2.0
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- feature-extraction
- sentence-similarity
library_name: colbert
colbertv1-camembert-base-mmarcoFR
This is a ColBERTv1 model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the French portion of the mMARCO dataset.
Usage
Using ColBERT on a dataset typically involves the following steps:
Step 1: Preprocess your collection. At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., collection.tsv
) will contain all passages and another (e.g., queries.tsv
) will contain a set of queries for searching the collection.
Step 2: Index your collection. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer
if __name__=='__main__':
with Run().context(RunConfig(nranks=1, experiment="msmarco")):
config = ColBERTConfig(
nbits=2,
root="/path/to/experiments",
)
indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
Step 3: Search the collection with your queries. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
from colbert.data import Queries
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Searcher
if __name__=='__main__':
with Run().context(RunConfig(nranks=1, experiment="msmarco")):
config = ColBERTConfig(
root="/path/to/experiments",
)
searcher = Searcher(index="msmarco.nbits=2", config=config)
queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
ranking = searcher.search_all(queries, k=100)
ranking.save("msmarco.nbits=2.ranking.tsv")
Evaluation
We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages.
[...]
Training
Background
We used the camembert-base model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query.
Hyperparameters
We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
Data
We used the French version of the mMARCO dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a large-scale IR dataset comprising:
- a corpus of 8.8M passages;
- a training set of ~533k queries (with at least one relevant passage);
- a development set of ~101k queries;
- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works). Link: https://ir-datasets.com/mmarco.html#mmarco/v2/fr/
Citation
@online{louis2023,
author = 'Antoine Louis',
title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
publisher = 'Hugging Face',
month = 'dec',
year = '2023',
url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}