--- library_name: transformers license: mit datasets: - google-research-datasets/natural_questions base_model: - google-bert/bert-base-uncased --- # svdr-nq Semi-Parametric Retrieval via Binary Token Index. Jiawei Zhou, Li Dong, Furu Wei, Lei Chen, arXiv 2024 The model is BERT-based with 12 layers and an embedding size of 20,523, derived from the BERT vocabulary of 30,522 with 999 unused tokens excluded. ## Quick Start Download and install `vsearch` repo: ``` git clone git@github.com:jzhoubu/vsearch.git poetry install poetry shell ``` Below is an example to encode queries and passages and compute similarity. ```python import torch from src.ir import Retriever query = "Who first proposed the theory of relativity?" passages = [ "Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.", "Sir Isaac Newton FRS (25 December 1642 – 20 March 1727) was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.", "Nikola Tesla (10 July 1856 – 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is known for his contributions to the design of the modern alternating current (AC) electricity supply system." ] ir = Retriever.from_pretrained("vsearch/svdr-nq") ir = ir.to("cuda") # Embed the query and passages q_emb = ir.encoder_q.embed(query) # Shape: [1, V] p_emb = ir.encoder_p.embed(passages) # Shape: [4, V] scores = q_emb @ p_emb.t() print(scores) # Output: tensor([[61.5432, 10.3108, 8.6709]], device='cuda:0') ``` ## Building Embedding-based Index for Search Below are examples to build index for large-scale retrieval ```python # Build the sparse index for the passages ir.build_index(passages, index_type="sparse") print(ir.index) # Output: # Index Type : SparseIndex # Vector Type : torch.sparse_csr # Vector Shape : torch.Size([3, 29523]) # Vector Device : cuda:0 # Number of Texts : 3 # Save the index to disk index_file = "/path/to/index.npz" ir.save_index(path) # Load the index from disk index_file = "/path/to/index.npz" data_file = "/path/to/texts.jsonl" ir.load_index(index_file=index_file, data_file=data_file) # Search top-k results for queries queries = [query] results = ir.retrieve(queries, k=3) print(results) # Output: # SearchResults( # ids=tensor([[0, 1, 2]], device='cuda:0'), # scores=tensor([[61.5432, 10.3108, 8.6709]], device='cuda:0') # ) query_id = 0 top1_psg_id = results.ids[query_id][0] top1_psg = ir.index.get_sample(top1_psg_id) print(top1_psg) # Output: # Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity. ``` ## Building Bag-of-token Index for Search Our framework supports using tokenization as an index (i.e., a bag-of-token index), which operates on CPU and reduces indexing time and storage requirements by over 90%, compare to an embedding-based index. ```python # Build the bag-of-token index for the passages ir.build_index(passages, index_type="bag_of_token") print(ir.index) # Output: # Index Type : BoTIndex # Vector Type : torch.sparse_csr # Vector Shape : torch.Size([3, 29523]) # Vector Device : cuda:0 # Number of Texts : 3 # Search top-k results from bag-of-token index, and embed and rerank them on-the-fly queries = [query] results = ir.retrieve(queries, k=3, rerank=True) print(results) # Output: # SearchResults( # ids=tensor([0, 2, 1], device='cuda:3'), # scores=tensor([61.5432, 10.3108, 8.6709], device='cuda:0') # ) ``` ## Training Details Please refer to our paper at [https://arxiv.org/pdf/2405.01924](https://arxiv.org/pdf/2405.01924). ## Citation If you find our paper or models helpful, please consider cite as follows: ``` @article{zhou2024semi, title={Semi-Parametric Retrieval via Binary Token Index}, author={Zhou, Jiawei and Dong, Li and Wei, Furu and Chen, Lei}, journal={arXiv preprint arXiv:2405.01924}, year={2024} } ```