antoinelouis
/

colbertv1-camembert-base-mmarcoFR

@@ -7,75 +7,99 @@ datasets:
 metrics:
 - recall
 tags:
-- sentence-similarity
 - colbert
 base_model: camembert-base
 library_name: RAGatouille
 inference: false
 ---
-# 🇫🇷 colbertv1-camembert-base-mmarcoFR
-This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
 ## Usage
-Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
-### Using ColBERT-AI
 First, you will need to install the following libraries:
 ```bash
-pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
 ```
 Then, you can use the model like this:
 ```python
-from colbert import Indexer, Searcher
-from colbert.infra import Run, RunConfig
-n_gpu: int = 1 # Set your number of available GPUs
-experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
 index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
 documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
-# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
-with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
-    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
-    indexer.index(name=index_name, collection=documents)
-# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
-with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
-    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
-    results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
-    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```
-### Using RAGatouille
 First, you will need to install the following libraries:
 ```bash
-pip install -U ragatouille
 ```
 Then, you can use the model like this:
 ```python
-from ragatouille import RAGPretrainedModel
 index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
 documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
-# Step 1: Indexing.
-RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
-RAG.index(name=index_name, collection=documents)
-# Step 2: Searching.
-RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
-RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
 ```
 ***
@@ -107,12 +131,14 @@ and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://d
 with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
 to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
 ## Citation
 ```bibtex
 @online{louis2023,
    author    = 'Antoine Louis',
-   title     = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model for French',
    publisher = 'Hugging Face',
    month     = 'dec',
    year      = '2023',

 metrics:
 - recall
 tags:
 - colbert
+- passage-retrieval
 base_model: camembert-base
 library_name: RAGatouille
 inference: false
+model-index:
+- name: colbertv1-camembert-base-mmarcoFR
+  results:
+    - task:
+        type: sentence-similarity
+        name: Passage Retrieval
+      dataset:
+        type: unicamp-dl/mmarco
+        name: mMARCO-fr
+        config: french
+        split: validation
+      metrics:
+        - type: recall_at_500
+          name: Recall@500
+          value: 88.40
+        - type: recall_at_100
+          name: Recall@100
+          value: 80.00
+        - type: recall_at_10
+          name: Recall@10
+          value: 54.21
+        - type: mrr_at_10
+          name: MRR@10
+          value: 29.51
 ---
+# colbertv1-camembert-base-mmarcoFR
+This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
 ## Usage
+Here are some examples for using the model with [RAGatouille](https://github.com/bclavie/RAGatouille) or [colbert-ai](https://github.com/stanford-futuredata/ColBERT).
+### Using RAGatouille
 First, you will need to install the following libraries:
 ```bash
+pip install -U ragatouille
 ```
 Then, you can use the model like this:
 ```python
+from ragatouille import RAGPretrainedModel
 index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
 documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
+# Step 1: Indexing.
+RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
+RAG.index(name=index_name, collection=documents)
+# Step 2: Searching.
+RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
+RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
 ```
+### Using ColBERT-AI
 First, you will need to install the following libraries:
 ```bash
+pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
 ```
 Then, you can use the model like this:
 ```python
+from colbert import Indexer, Searcher
+from colbert.infra import Run, RunConfig
+n_gpu: int = 1 # Set your number of available GPUs
+experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
 index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
 documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
+# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
+with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
+    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
+    indexer.index(name=index_name, collection=documents)
+# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
+with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
+    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
+    results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
+    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```
 ***
 with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
 to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
+***
 ## Citation
 ```bibtex
 @online{louis2023,
    author    = 'Antoine Louis',
+   title     = 'colbertv1-camembert-base-mmarcoFR: The 1st ColBERT Model for French',
    publisher = 'Hugging Face',
    month     = 'dec',
    year      = '2023',