antoinelouis
commited on
Commit
•
b463025
1
Parent(s):
32a3049
Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
pipeline_tag: sentence-similarity
|
3 |
language: fr
|
4 |
-
license:
|
5 |
datasets:
|
6 |
- unicamp-dl/mmarco
|
7 |
metrics:
|
@@ -24,7 +24,6 @@ To use this model, you will need to install the following libraries:
|
|
24 |
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
|
25 |
```
|
26 |
|
27 |
-
|
28 |
## Usage
|
29 |
|
30 |
**Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
|
@@ -76,19 +75,19 @@ The model is evaluated on the smaller development set of mMARCO-fr, which consis
|
|
76 |
|
77 |
## Training
|
78 |
|
79 |
-
#### Details
|
80 |
-
|
81 |
-
The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and fine-tuned on 12.8M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
|
82 |
-
|
83 |
#### Data
|
84 |
|
85 |
-
|
86 |
-
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
|
91 |
-
The
|
|
|
|
|
|
|
|
|
92 |
|
93 |
## Citation
|
94 |
|
|
|
1 |
---
|
2 |
pipeline_tag: sentence-similarity
|
3 |
language: fr
|
4 |
+
license: mit
|
5 |
datasets:
|
6 |
- unicamp-dl/mmarco
|
7 |
metrics:
|
|
|
24 |
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
|
25 |
```
|
26 |
|
|
|
27 |
## Usage
|
28 |
|
29 |
**Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
|
|
|
75 |
|
76 |
## Training
|
77 |
|
|
|
|
|
|
|
|
|
78 |
#### Data
|
79 |
|
80 |
+
We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset,
|
81 |
+
a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries.
|
82 |
+
We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).
|
83 |
+
|
84 |
+
#### Implementation
|
85 |
|
86 |
+
The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax
|
87 |
+
cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832))
|
88 |
+
and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU
|
89 |
+
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
|
90 |
+
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
|
91 |
|
92 |
## Citation
|
93 |
|