antoinelouis
/

colbertv1-camembert-base-mmarcoFR

+---
+pipeline_tag: feature-extraction
+language: fr
+license: apache-2.0
+datasets:
+- unicamp-dl/mmarco
+metrics:
+- recall
+tags:
+- feature-extraction
+- sentence-similarity
+library_name: colbert
+---
+# colbertv1-camembert-base-mmarcoFR
+This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
+## Usage
+***
+Using ColBERT on a dataset typically involves the following steps:
+**Step 1: Preprocess your collection.** At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., `collection.tsv`) will contain all passages and another (e.g., `queries.tsv`) will contain a set of queries for searching the collection.
+**Step 2: Index your collection.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
+```
+from colbert.infra import Run, RunConfig, ColBERTConfig
+from colbert import Indexer
+if __name__=='__main__':
+    with Run().context(RunConfig(nranks=1, experiment="msmarco")):
+        config = ColBERTConfig(
+            nbits=2,
+            root="/path/to/experiments",
+        )
+        indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
+        indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
+```
+**Step 3: Search the collection with your queries.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
+```
+from colbert.data import Queries
+from colbert.infra import Run, RunConfig, ColBERTConfig
+from colbert import Searcher
+if __name__=='__main__':
+    with Run().context(RunConfig(nranks=1, experiment="msmarco")):
+        config = ColBERTConfig(
+            root="/path/to/experiments",
+        )
+        searcher = Searcher(index="msmarco.nbits=2", config=config)
+        queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
+        ranking = searcher.search_all(queries, k=100)
+        ranking.save("msmarco.nbits=2.ranking.tsv")
+```
+## Evaluation
+***
+We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages.
+[...]
+## Training
+***
+#### Background
+We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query.
+#### Hyperparameters
+We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
+#### Data
+We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a large-scale IR dataset comprising:
+- a corpus of 8.8M passages;
+- a training set of ~533k queries (with at least one relevant passage);
+- a development set of ~101k queries;
+- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
+Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
+## Citation
+```bibtex
+@online{louis2023,
+   author    = 'Antoine Louis',
+   title     = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
+   publisher = 'Hugging Face',
+   month     = 'dec',
+   year      = '2023',
+   url       = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
+}
+```