antoinelouis commited on
Commit
faa6aeb
1 Parent(s): cb16512

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: feature-extraction
3
+ language: fr
4
+ license: apache-2.0
5
+ datasets:
6
+ - unicamp-dl/mmarco
7
+ metrics:
8
+ - recall
9
+ tags:
10
+ - feature-extraction
11
+ - sentence-similarity
12
+ library_name: colbert
13
+ ---
14
+
15
+ # colbertv1-camembert-base-mmarcoFR
16
+
17
+ This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
18
+
19
+ ## Usage
20
+ ***
21
+
22
+ Using ColBERT on a dataset typically involves the following steps:
23
+
24
+ **Step 1: Preprocess your collection.** At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., `collection.tsv`) will contain all passages and another (e.g., `queries.tsv`) will contain a set of queries for searching the collection.
25
+
26
+ **Step 2: Index your collection.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
27
+ ```
28
+ from colbert.infra import Run, RunConfig, ColBERTConfig
29
+ from colbert import Indexer
30
+
31
+ if __name__=='__main__':
32
+ with Run().context(RunConfig(nranks=1, experiment="msmarco")):
33
+
34
+ config = ColBERTConfig(
35
+ nbits=2,
36
+ root="/path/to/experiments",
37
+ )
38
+ indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
39
+ indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
40
+ ```
41
+
42
+ **Step 3: Search the collection with your queries.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
43
+ ```
44
+ from colbert.data import Queries
45
+ from colbert.infra import Run, RunConfig, ColBERTConfig
46
+ from colbert import Searcher
47
+
48
+ if __name__=='__main__':
49
+ with Run().context(RunConfig(nranks=1, experiment="msmarco")):
50
+
51
+ config = ColBERTConfig(
52
+ root="/path/to/experiments",
53
+ )
54
+ searcher = Searcher(index="msmarco.nbits=2", config=config)
55
+ queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
56
+ ranking = searcher.search_all(queries, k=100)
57
+ ranking.save("msmarco.nbits=2.ranking.tsv")
58
+ ```
59
+
60
+
61
+ ## Evaluation
62
+ ***
63
+
64
+ We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages.
65
+
66
+ [...]
67
+
68
+ ## Training
69
+ ***
70
+
71
+ #### Background
72
+
73
+ We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query.
74
+
75
+ #### Hyperparameters
76
+
77
+ We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
78
+
79
+ #### Data
80
+
81
+ We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a large-scale IR dataset comprising:
82
+ - a corpus of 8.8M passages;
83
+ - a training set of ~533k queries (with at least one relevant passage);
84
+ - a development set of ~101k queries;
85
+ - a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
86
+ Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
87
+
88
+ ## Citation
89
+
90
+ ```bibtex
91
+ @online{louis2023,
92
+ author = 'Antoine Louis',
93
+ title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
94
+ publisher = 'Hugging Face',
95
+ month = 'dec',
96
+ year = '2023',
97
+ url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
98
+ }
99
+ ```