Sentence Similarity
Safetensors
Japanese
RAGatouille
bert
ColBERT
bclavie commited on
Commit
ba13dad
1 Parent(s): 1d6786b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -35
README.md CHANGED
@@ -90,63 +90,75 @@ Worth noting: JaColBERT is evaluated out-of-domain on all three datasets, wherea
90
 
91
  ## Installation
92
 
93
- **ColBERT pypy installation is temporarily broken! Please use the instructions on the [official github repo (install from Source)](https://github.com/stanford-futuredata/ColBERT) in the meantime. Sorry for the inconvenience!**
 
 
 
94
 
95
- Using this model is slightly different from using typical dense embedding models. The model relies on `faiss`, for efficient indexing, and `torch`, for NN operations. JaColBERT is built upon bert-base-japanese-v3, so you also need to install the required dictionary and tokenizers:
96
 
97
- To use JaColBERT, you will need to install the main ColBERT and those dependencies library:
98
 
99
- ```
100
- pip install colbert-ir[faiss-gpu] faiss torch fugashi unidic-lite
 
 
 
 
 
 
 
 
 
 
101
  ```
102
 
103
- ColBERT looks slightly more unfriendly than a usual `transformers` model, but a lot of it is just making the config apparent so you can easily modify it! Running with all defaults work very well, so don't be anxious about trying.
104
 
105
- ## Indexing
106
 
107
- > ⚠️ ColBERT indexing requires a GPU! You can, however, very easily index thousands and thousands of documents using Google Colab's free GPUs.
108
 
109
  In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
110
  Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
111
  Indexing is the slowest step retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
112
 
113
  ```python
114
- from colbert import Indexer
115
- from colbert.infra import Run, RunConfig
116
-
117
- n_gpu: int = 1 # Set your number of available GPUs
118
- experiment: str = "" # Name of the folder where the logs and created indices will be stored
119
- index_name: str = "" # The name of your index, i.e. the name of your vector database
120
-
121
- with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
122
- indexer = Indexer(checkpoint="bclavie/JaColBERT")
123
- documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",
124
- ...
125
- ]
126
- indexer.index(name=index_name, collection=documents)
127
  ```
128
 
 
 
129
  And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
130
 
131
 
132
  ## Searching
133
 
134
- Once you have created an index, searching through it is just as simple, again with the Run() syntactic sugar to manage GPUs and storage:
 
 
 
 
 
 
 
135
 
136
  ```python
137
- from colbert import Searcher
138
- from colbert.infra import Run, RunConfig
139
-
140
- n_gpu: int = 0
141
- experiment: str = "" # Name of the folder where the logs and created indices will be stored
142
- index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
143
- k: int = 10 # how many results you want to retrieve
144
-
145
- with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
146
- searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
147
- query = "マクドナルドの小さなフライドポテトのカロリーはいくつですか"
148
- results = searcher.search(query, k=k)
149
- # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
150
  ```
151
 
152
 
 
90
 
91
  ## Installation
92
 
93
+ JaColBERT works using ColBERT+RAGatouille. You can install it and all its necessary dependencies by running:
94
+ ```sh
95
+ pip install -U ragatouille
96
+ ```
97
 
98
+ For further examples on how to use RAGatouille with ColBERT models, you can check out the [`examples` section it the github repository](https://github.com/bclavie/RAGatouille/tree/main/examples).
99
 
100
+ Specifically, example 01 shows how to build/query an index, 04 shows how you can use JaColBERT as a re-ranker, and 06 shows how to use JaColBERT for in-memory searching rather than using an index.
101
 
102
+ Notably, RAGatouille has metadata support, so check the examples out if it's something you need!
103
+
104
+ ## Encoding and querying documents without an index
105
+
106
+ If you want to use JaColBERT without building an index, it's very simple, you just need to load the model, `encode()` some documents, and then `search_encoded_documents()`:
107
+
108
+ ```python
109
+ from ragatouille import RAGPretrainedModel
110
+ RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")
111
+
112
+ RAG.encode(['document_1', 'document_2', ...])
113
+ RAG.search_encoded_documents(query="your search query")
114
  ```
115
 
116
+ Subsequent calls to `encode()` will add to the existing in-memory collection. If you want to empty it, simply run `RAG.clear_encoded_docs()`.
117
 
 
118
 
119
+ ## Indexing
120
 
121
  In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
122
  Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
123
  Indexing is the slowest step retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
124
 
125
  ```python
126
+ from ragatouille import RAGPretrainedModel
127
+
128
+ RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")
129
+ documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",]
130
+ RAG.index(name="My_first_index", collection=documents)
 
 
 
 
 
 
 
 
131
  ```
132
 
133
+ The index files are stored, by default, at `.ragatouille/colbert/indexes/{index_name}`.
134
+
135
  And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
136
 
137
 
138
  ## Searching
139
 
140
+ Once you have created an index, searching through it is just as simple! If you're in the same session and `RAG` is still loaded, you can directly search the newly created index.
141
+ Otherwise, you'll want to load it from disk:
142
+
143
+ ```python
144
+ RAG = RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/My_first_index")
145
+ ```
146
+
147
+ And then query it:
148
 
149
  ```python
150
+ RAG.search(query="What animation studio did Miyazaki found?")
151
+ > [[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\n\n\n=== Studio Ghibli ===\n\n\n==== Early films (1985–1996) ====\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates".',
152
+ 'score': 25.90448570251465,
153
+ 'rank': 1,
154
+ 'document_id': 'miyazaki',
155
+ 'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},
156
+ {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, Japanese: [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in Tokyo City in the Empire of Japan, Miyazaki expressed interest in manga and animation from an early age, and he joined Toei Animation in 1963. During his early years at Toei Animation he worked as an in-between artist and later collaborated with director Isao Takahata.',
157
+ 'score': 25.572620391845703,
158
+ 'rank': 2,
159
+ 'document_id': 'miyazaki',
160
+ 'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},
161
+ [...]
 
162
  ```
163
 
164