antoinelouis
/

colbertv1-camembert-base-mmarcoFR

@@ -1,7 +1,7 @@
 ---
 pipeline_tag: sentence-similarity
 language: fr
-license: apache-2.0
 datasets:
 - unicamp-dl/mmarco
 metrics:
@@ -24,7 +24,6 @@ To use this model, you will need to install the following libraries:
 pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
 ```
 ## Usage
 **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
@@ -76,19 +75,19 @@ The model is evaluated on the smaller development set of mMARCO-fr, which consis
 ## Training
-#### Details
-The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and fine-tuned on 12.8M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
 #### Data
-The model is fine-tuned on the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multi-lingual machine-translated version of the MS MARCO dataset which comprises:
-- a corpus of 8.8M passages;
-- a training set of ~533k unique queries (with at least one relevant passage);
-- a development set of ~101k queries;
-- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
-The triples are sampled from the ~39.8M triples of [triples.train.small.tsv](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset). In the future, better negatives could be selected by exploiting the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives) dataset that contains 50 hard negatives mined from BM25 and 12 dense retrievers for each training query.
 ## Citation

 ---
 pipeline_tag: sentence-similarity
 language: fr
+license: mit
 datasets:
 - unicamp-dl/mmarco
 metrics:
 pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
 ```
 ## Usage
 **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
 ## Training
 #### Data
+We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset,
+a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries.
+We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).
+#### Implementation
+The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax
+cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832))
+and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU
+with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
+to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
 ## Citation