Update README.md
Browse files
README.md
CHANGED
@@ -90,63 +90,75 @@ Worth noting: JaColBERT is evaluated out-of-domain on all three datasets, wherea
|
|
90 |
|
91 |
## Installation
|
92 |
|
93 |
-
|
|
|
|
|
|
|
94 |
|
95 |
-
|
96 |
|
97 |
-
|
98 |
|
99 |
-
|
100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
101 |
```
|
102 |
|
103 |
-
|
104 |
|
105 |
-
## Indexing
|
106 |
|
107 |
-
|
108 |
|
109 |
In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
|
110 |
Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
|
111 |
Indexing is the slowest step retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
|
112 |
|
113 |
```python
|
114 |
-
from
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
index_name: str = "" # The name of your index, i.e. the name of your vector database
|
120 |
-
|
121 |
-
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
122 |
-
indexer = Indexer(checkpoint="bclavie/JaColBERT")
|
123 |
-
documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",
|
124 |
-
...
|
125 |
-
]
|
126 |
-
indexer.index(name=index_name, collection=documents)
|
127 |
```
|
128 |
|
|
|
|
|
129 |
And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
|
130 |
|
131 |
|
132 |
## Searching
|
133 |
|
134 |
-
Once you have created an index, searching through it is just as simple
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
```python
|
137 |
-
|
138 |
-
from
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
150 |
```
|
151 |
|
152 |
|
|
|
90 |
|
91 |
## Installation
|
92 |
|
93 |
+
JaColBERT works using ColBERT+RAGatouille. You can install it and all its necessary dependencies by running:
|
94 |
+
```sh
|
95 |
+
pip install -U ragatouille
|
96 |
+
```
|
97 |
|
98 |
+
For further examples on how to use RAGatouille with ColBERT models, you can check out the [`examples` section it the github repository](https://github.com/bclavie/RAGatouille/tree/main/examples).
|
99 |
|
100 |
+
Specifically, example 01 shows how to build/query an index, 04 shows how you can use JaColBERT as a re-ranker, and 06 shows how to use JaColBERT for in-memory searching rather than using an index.
|
101 |
|
102 |
+
Notably, RAGatouille has metadata support, so check the examples out if it's something you need!
|
103 |
+
|
104 |
+
## Encoding and querying documents without an index
|
105 |
+
|
106 |
+
If you want to use JaColBERT without building an index, it's very simple, you just need to load the model, `encode()` some documents, and then `search_encoded_documents()`:
|
107 |
+
|
108 |
+
```python
|
109 |
+
from ragatouille import RAGPretrainedModel
|
110 |
+
RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")
|
111 |
+
|
112 |
+
RAG.encode(['document_1', 'document_2', ...])
|
113 |
+
RAG.search_encoded_documents(query="your search query")
|
114 |
```
|
115 |
|
116 |
+
Subsequent calls to `encode()` will add to the existing in-memory collection. If you want to empty it, simply run `RAG.clear_encoded_docs()`.
|
117 |
|
|
|
118 |
|
119 |
+
## Indexing
|
120 |
|
121 |
In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
|
122 |
Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
|
123 |
Indexing is the slowest step retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
|
124 |
|
125 |
```python
|
126 |
+
from ragatouille import RAGPretrainedModel
|
127 |
+
|
128 |
+
RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")
|
129 |
+
documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",]
|
130 |
+
RAG.index(name="My_first_index", collection=documents)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
```
|
132 |
|
133 |
+
The index files are stored, by default, at `.ragatouille/colbert/indexes/{index_name}`.
|
134 |
+
|
135 |
And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
|
136 |
|
137 |
|
138 |
## Searching
|
139 |
|
140 |
+
Once you have created an index, searching through it is just as simple! If you're in the same session and `RAG` is still loaded, you can directly search the newly created index.
|
141 |
+
Otherwise, you'll want to load it from disk:
|
142 |
+
|
143 |
+
```python
|
144 |
+
RAG = RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/My_first_index")
|
145 |
+
```
|
146 |
+
|
147 |
+
And then query it:
|
148 |
|
149 |
```python
|
150 |
+
RAG.search(query="What animation studio did Miyazaki found?")
|
151 |
+
> [[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\n\n\n=== Studio Ghibli ===\n\n\n==== Early films (1985–1996) ====\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates".',
|
152 |
+
'score': 25.90448570251465,
|
153 |
+
'rank': 1,
|
154 |
+
'document_id': 'miyazaki',
|
155 |
+
'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},
|
156 |
+
{'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, Japanese: [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in Tokyo City in the Empire of Japan, Miyazaki expressed interest in manga and animation from an early age, and he joined Toei Animation in 1963. During his early years at Toei Animation he worked as an in-between artist and later collaborated with director Isao Takahata.',
|
157 |
+
'score': 25.572620391845703,
|
158 |
+
'rank': 2,
|
159 |
+
'document_id': 'miyazaki',
|
160 |
+
'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},
|
161 |
+
[...]
|
|
|
162 |
```
|
163 |
|
164 |
|