antoinelouis commited on
Commit
b463025
1 Parent(s): 32a3049

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -12
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  pipeline_tag: sentence-similarity
3
  language: fr
4
- license: apache-2.0
5
  datasets:
6
  - unicamp-dl/mmarco
7
  metrics:
@@ -24,7 +24,6 @@ To use this model, you will need to install the following libraries:
24
  pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
25
  ```
26
 
27
-
28
  ## Usage
29
 
30
  **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
@@ -76,19 +75,19 @@ The model is evaluated on the smaller development set of mMARCO-fr, which consis
76
 
77
  ## Training
78
 
79
- #### Details
80
-
81
- The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and fine-tuned on 12.8M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
82
-
83
  #### Data
84
 
85
- The model is fine-tuned on the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multi-lingual machine-translated version of the MS MARCO dataset which comprises:
86
- - a corpus of 8.8M passages;
87
- - a training set of ~533k unique queries (with at least one relevant passage);
88
- - a development set of ~101k queries;
89
- - a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
90
 
91
- The triples are sampled from the ~39.8M triples of [triples.train.small.tsv](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset). In the future, better negatives could be selected by exploiting the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives) dataset that contains 50 hard negatives mined from BM25 and 12 dense retrievers for each training query.
 
 
 
 
92
 
93
  ## Citation
94
 
 
1
  ---
2
  pipeline_tag: sentence-similarity
3
  language: fr
4
+ license: mit
5
  datasets:
6
  - unicamp-dl/mmarco
7
  metrics:
 
24
  pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
25
  ```
26
 
 
27
  ## Usage
28
 
29
  **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
 
75
 
76
  ## Training
77
 
 
 
 
 
78
  #### Data
79
 
80
+ We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset,
81
+ a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries.
82
+ We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).
83
+
84
+ #### Implementation
85
 
86
+ The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax
87
+ cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832))
88
+ and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU
89
+ with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
90
+ to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
91
 
92
  ## Citation
93