bclavie
/

JaColBERT

@@ -90,63 +90,75 @@ Worth noting: JaColBERT is evaluated out-of-domain on all three datasets, wherea
 ## Installation
-**ColBERT pypy installation is temporarily broken! Please use the instructions on the [official github repo (install from Source)](https://github.com/stanford-futuredata/ColBERT) in the meantime. Sorry for the inconvenience!**
-Using this model is slightly different from using typical dense embedding models. The model relies on `faiss`, for efficient indexing, and `torch`, for NN operations. JaColBERT is built upon bert-base-japanese-v3, so you also need to install the required dictionary and tokenizers:
-To use JaColBERT, you will need to install the main ColBERT and those dependencies library:
-```
-pip install colbert-ir[faiss-gpu] faiss torch fugashi unidic-lite
 ```
-ColBERT looks slightly more unfriendly than a usual `transformers` model, but a lot of it is just making the config apparent so you can easily modify it! Running with all defaults work very well, so don't be anxious about trying.
-## Indexing
-> ⚠️ ColBERT indexing requires a GPU! You can, however, very easily index thousands and thousands of documents using Google Colab's free GPUs.
 In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
 Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
 Indexing is the slowest step  retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
 ```python
-from colbert import Indexer
-from colbert.infra import Run, RunConfig
-n_gpu: int = 1 # Set your number of available GPUs
-experiment: str = "" # Name of the folder where the logs and created indices will be stored
-index_name: str = "" # The name of your index, i.e. the name of your vector database
-with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
-    indexer = Indexer(checkpoint="bclavie/JaColBERT")
-    documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか？マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",
-    ...
-    ]
-    indexer.index(name=index_name, collection=documents)
 ```
 And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
 ## Searching
-Once you have created an index, searching through it is just as simple, again with the Run() syntactic sugar to manage GPUs and storage:
 ```python
-from colbert import Searcher
-from colbert.infra import Run, RunConfig
-n_gpu: int = 0
-experiment: str = "" # Name of the folder where the logs and created indices will be stored
-index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
-k: int = 10 # how many results you want to retrieve
-with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
-    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
-    query = "マクドナルドの小さなフライドポテトのカロリーはいくつですか"
-    results = searcher.search(query, k=k)
-    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```

 ## Installation
+JaColBERT works using ColBERT+RAGatouille. You can install it and all its necessary dependencies by running:
+```sh
+pip install -U ragatouille
+```
+For further examples on how to use RAGatouille with ColBERT models, you can check out the [`examples` section it the github repository](https://github.com/bclavie/RAGatouille/tree/main/examples).
+Specifically, example 01 shows how to build/query an index, 04 shows how you can use JaColBERT as a re-ranker, and 06 shows how to use JaColBERT for in-memory searching rather than using an index.
+Notably, RAGatouille has metadata support, so check the examples out if it's something you need!
+## Encoding and querying documents without an index
+If you want to use JaColBERT without building an index, it's very simple, you just need to load the model, `encode()` some documents, and then `search_encoded_documents()`:
+```python
+from ragatouille import RAGPretrainedModel
+RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")
+RAG.encode(['document_1', 'document_2', ...])
+RAG.search_encoded_documents(query="your search query")
 ```
+Subsequent calls to `encode()` will add to the existing in-memory collection. If you want to empty it, simply run `RAG.clear_encoded_docs()`.
+## Indexing
 In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
 Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
 Indexing is the slowest step  retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
 ```python
+from ragatouille import RAGPretrainedModel
+RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")
+documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか？マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",]
+RAG.index(name="My_first_index", collection=documents)
 ```
+The index files are stored, by default, at `.ragatouille/colbert/indexes/{index_name}`.
 And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
 ## Searching
+Once you have created an index, searching through it is just as simple! If you're in the same session and `RAG` is still loaded, you can directly search the newly created index.
+Otherwise, you'll want to load it from disk:
+```python
+RAG = RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/My_first_index")
+```
+And then query it:
 ```python
+RAG.search(query="What animation studio did Miyazaki found?")
+> [[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\n\n\n=== Studio Ghibli ===\n\n\n==== Early films (1985–1996) ====\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates".',
+   'score': 25.90448570251465,
+   'rank': 1,
+   'document_id': 'miyazaki',
+   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},
+  {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, Japanese: [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in Tokyo City in the Empire of Japan, Miyazaki expressed interest in manga and animation from an early age, and he joined Toei Animation in 1963. During his early years at Toei Animation he worked as an in-between artist and later collaborated with director Isao Takahata.',
+   'score': 25.572620391845703,
+   'rank': 2,
+   'document_id': 'miyazaki',
+   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},
+  [...]
 ```