antoinelouis commited on
Commit
ee7bcda
1 Parent(s): 7627c1f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -29
README.md CHANGED
@@ -7,75 +7,99 @@ datasets:
7
  metrics:
8
  - recall
9
  tags:
10
- - sentence-similarity
11
  - colbert
 
12
  base_model: camembert-base
13
  library_name: RAGatouille
14
  inference: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- # 🇫🇷 colbertv1-camembert-base-mmarcoFR
18
 
19
- This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
20
 
21
  ## Usage
22
 
23
- Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
24
 
25
- ### Using ColBERT-AI
26
 
27
  First, you will need to install the following libraries:
28
 
29
  ```bash
30
- pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
31
  ```
32
 
33
  Then, you can use the model like this:
34
 
35
  ```python
36
- from colbert import Indexer, Searcher
37
- from colbert.infra import Run, RunConfig
38
 
39
- n_gpu: int = 1 # Set your number of available GPUs
40
- experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
41
  index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
42
  documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
43
 
44
- # Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
45
- with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
46
- indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
47
- indexer.index(name=index_name, collection=documents)
48
 
49
- # Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
50
- with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
51
- searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
52
- results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
53
- # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
54
  ```
55
 
56
- ### Using RAGatouille
57
 
58
  First, you will need to install the following libraries:
59
 
60
  ```bash
61
- pip install -U ragatouille
62
  ```
63
 
64
  Then, you can use the model like this:
65
 
66
  ```python
67
- from ragatouille import RAGPretrainedModel
 
68
 
 
 
69
  index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
70
  documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
71
 
72
- # Step 1: Indexing.
73
- RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
74
- RAG.index(name=index_name, collection=documents)
 
75
 
76
- # Step 2: Searching.
77
- RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
78
- RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
 
 
79
  ```
80
 
81
  ***
@@ -107,12 +131,14 @@ and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://d
107
  with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
108
  to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
109
 
 
 
110
  ## Citation
111
 
112
  ```bibtex
113
  @online{louis2023,
114
  author = 'Antoine Louis',
115
- title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model for French',
116
  publisher = 'Hugging Face',
117
  month = 'dec',
118
  year = '2023',
 
7
  metrics:
8
  - recall
9
  tags:
 
10
  - colbert
11
+ - passage-retrieval
12
  base_model: camembert-base
13
  library_name: RAGatouille
14
  inference: false
15
+ model-index:
16
+ - name: colbertv1-camembert-base-mmarcoFR
17
+ results:
18
+ - task:
19
+ type: sentence-similarity
20
+ name: Passage Retrieval
21
+ dataset:
22
+ type: unicamp-dl/mmarco
23
+ name: mMARCO-fr
24
+ config: french
25
+ split: validation
26
+ metrics:
27
+ - type: recall_at_500
28
+ name: Recall@500
29
+ value: 88.40
30
+ - type: recall_at_100
31
+ name: Recall@100
32
+ value: 80.00
33
+ - type: recall_at_10
34
+ name: Recall@10
35
+ value: 54.21
36
+ - type: mrr_at_10
37
+ name: MRR@10
38
+ value: 29.51
39
  ---
40
 
41
+ # colbertv1-camembert-base-mmarcoFR
42
 
43
+ This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
44
 
45
  ## Usage
46
 
47
+ Here are some examples for using the model with [RAGatouille](https://github.com/bclavie/RAGatouille) or [colbert-ai](https://github.com/stanford-futuredata/ColBERT).
48
 
49
+ ### Using RAGatouille
50
 
51
  First, you will need to install the following libraries:
52
 
53
  ```bash
54
+ pip install -U ragatouille
55
  ```
56
 
57
  Then, you can use the model like this:
58
 
59
  ```python
60
+ from ragatouille import RAGPretrainedModel
 
61
 
 
 
62
  index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
63
  documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
64
 
65
+ # Step 1: Indexing.
66
+ RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
67
+ RAG.index(name=index_name, collection=documents)
 
68
 
69
+ # Step 2: Searching.
70
+ RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
71
+ RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
 
 
72
  ```
73
 
74
+ ### Using ColBERT-AI
75
 
76
  First, you will need to install the following libraries:
77
 
78
  ```bash
79
+ pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
80
  ```
81
 
82
  Then, you can use the model like this:
83
 
84
  ```python
85
+ from colbert import Indexer, Searcher
86
+ from colbert.infra import Run, RunConfig
87
 
88
+ n_gpu: int = 1 # Set your number of available GPUs
89
+ experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
90
  index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
91
  documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
92
 
93
+ # Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
94
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
95
+ indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
96
+ indexer.index(name=index_name, collection=documents)
97
 
98
+ # Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
99
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
100
+ searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
101
+ results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
102
+ # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
103
  ```
104
 
105
  ***
 
131
  with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
132
  to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
133
 
134
+ ***
135
+
136
  ## Citation
137
 
138
  ```bibtex
139
  @online{louis2023,
140
  author = 'Antoine Louis',
141
+ title = 'colbertv1-camembert-base-mmarcoFR: The 1st ColBERT Model for French',
142
  publisher = 'Hugging Face',
143
  month = 'dec',
144
  year = '2023',