antoinelouis
commited on
Commit
•
faa6aeb
1
Parent(s):
cb16512
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: feature-extraction
|
3 |
+
language: fr
|
4 |
+
license: apache-2.0
|
5 |
+
datasets:
|
6 |
+
- unicamp-dl/mmarco
|
7 |
+
metrics:
|
8 |
+
- recall
|
9 |
+
tags:
|
10 |
+
- feature-extraction
|
11 |
+
- sentence-similarity
|
12 |
+
library_name: colbert
|
13 |
+
---
|
14 |
+
|
15 |
+
# colbertv1-camembert-base-mmarcoFR
|
16 |
+
|
17 |
+
This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
|
18 |
+
|
19 |
+
## Usage
|
20 |
+
***
|
21 |
+
|
22 |
+
Using ColBERT on a dataset typically involves the following steps:
|
23 |
+
|
24 |
+
**Step 1: Preprocess your collection.** At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., `collection.tsv`) will contain all passages and another (e.g., `queries.tsv`) will contain a set of queries for searching the collection.
|
25 |
+
|
26 |
+
**Step 2: Index your collection.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
|
27 |
+
```
|
28 |
+
from colbert.infra import Run, RunConfig, ColBERTConfig
|
29 |
+
from colbert import Indexer
|
30 |
+
|
31 |
+
if __name__=='__main__':
|
32 |
+
with Run().context(RunConfig(nranks=1, experiment="msmarco")):
|
33 |
+
|
34 |
+
config = ColBERTConfig(
|
35 |
+
nbits=2,
|
36 |
+
root="/path/to/experiments",
|
37 |
+
)
|
38 |
+
indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
|
39 |
+
indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
|
40 |
+
```
|
41 |
+
|
42 |
+
**Step 3: Search the collection with your queries.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
|
43 |
+
```
|
44 |
+
from colbert.data import Queries
|
45 |
+
from colbert.infra import Run, RunConfig, ColBERTConfig
|
46 |
+
from colbert import Searcher
|
47 |
+
|
48 |
+
if __name__=='__main__':
|
49 |
+
with Run().context(RunConfig(nranks=1, experiment="msmarco")):
|
50 |
+
|
51 |
+
config = ColBERTConfig(
|
52 |
+
root="/path/to/experiments",
|
53 |
+
)
|
54 |
+
searcher = Searcher(index="msmarco.nbits=2", config=config)
|
55 |
+
queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
|
56 |
+
ranking = searcher.search_all(queries, k=100)
|
57 |
+
ranking.save("msmarco.nbits=2.ranking.tsv")
|
58 |
+
```
|
59 |
+
|
60 |
+
|
61 |
+
## Evaluation
|
62 |
+
***
|
63 |
+
|
64 |
+
We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages.
|
65 |
+
|
66 |
+
[...]
|
67 |
+
|
68 |
+
## Training
|
69 |
+
***
|
70 |
+
|
71 |
+
#### Background
|
72 |
+
|
73 |
+
We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query.
|
74 |
+
|
75 |
+
#### Hyperparameters
|
76 |
+
|
77 |
+
We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
|
78 |
+
|
79 |
+
#### Data
|
80 |
+
|
81 |
+
We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a large-scale IR dataset comprising:
|
82 |
+
- a corpus of 8.8M passages;
|
83 |
+
- a training set of ~533k queries (with at least one relevant passage);
|
84 |
+
- a development set of ~101k queries;
|
85 |
+
- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
|
86 |
+
Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
|
87 |
+
|
88 |
+
## Citation
|
89 |
+
|
90 |
+
```bibtex
|
91 |
+
@online{louis2023,
|
92 |
+
author = 'Antoine Louis',
|
93 |
+
title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
|
94 |
+
publisher = 'Hugging Face',
|
95 |
+
month = 'dec',
|
96 |
+
year = '2023',
|
97 |
+
url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
|
98 |
+
}
|
99 |
+
```
|