Spaces:

derek-thomas
/

RAGDemo

Runtime error

App Files Files Community

derek-thomas HF staff commited on Oct 21, 2023

Commit

5b7578c

•

1 Parent(s): 75b3ab4

Added vector db

Browse files

Files changed (2) hide show

notebooks/04_vector_db.ipynb +241 -0
requirements.txt +1 -0

notebooks/04_vector_db.ipynb ADDED Viewed

	@@ -0,0 +1,241 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6a151ade-7d86-4a2e-bfe7-462089f4e04c",
+   "metadata": {},
+   "source": [
+    "# Approach\n",
+    "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
+    "\n",
+    "Im targeting a demo (low utilization, latency can be relaxed) that will live on a huggingface space. I have a small scale that could even fit in memory. I like [Qdrant](https://qdrant.tech) for this. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
+   "metadata": {},
+   "source": [
+    "# Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "88408486-566a-4791-8ef2-5ee3e6941156",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.interactiveshell import InteractiveShell\n",
+    "InteractiveShell.ast_node_interactivity = 'all'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import pickle\n",
+    "\n",
+    "from tqdm.notebook import tqdm\n",
+    "from haystack.schema import Document\n",
+    "from qdrant_haystack import QdrantDocumentStore"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/ec2-user/RAGDemo\n"
+     ]
+    }
+   ],
+   "source": [
+    "proj_dir = Path.cwd().parent\n",
+    "print(proj_dir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
+   "metadata": {},
+   "source": [
+    "# Config"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "file_in = proj_dir / 'data/processed/simple_wiki_embeddings.pkl'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2dd0df0-4274-45b3-9ee5-0205494e4d75",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Setup\n",
+    "Read in our list of dictionaries. This is the upper end for the machine Im using. This takes ~10GB of RAM. We could easily do this in batches of ~100k and be fine in most machines. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "3c08e039-3686-4eca-9f87-7c469e3f19bc",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 11.6 s, sys: 2.25 s, total: 13.9 s\n",
+      "Wall time: 18.1 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "with open(file_in, 'rb') as handle:\n",
+    "    documents = pickle.load(handle)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98aec715-8d97-439e-99c0-0eff63df386b",
+   "metadata": {},
+   "source": [
+    "Convert the dictionaries to `Documents`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "4821e3c1-697d-4b69-bae3-300168755df9",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "documents = [Document.from_dict(d) for d in documents]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "676f644c-fb09-4d17-89ba-30c92aad8777",
+   "metadata": {},
+   "source": [
+    "Instantiate our `DocumentStore`. Note that Im saving this to disk, this is for portability which is good considering I want to move from this ec2 instance into a Hugging Face Space. \n",
+    "\n",
+    "Note that if you are doing this at scale, you should use a proper instance and not saving to file. You should also take a [measured ingestion](https://qdrant.tech/documentation/tutorials/bulk-upload/) approach instead of using a convenient loader. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "e51b6e19-3be8-4cb0-8b65-9d6f6121f660",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "document_store = QdrantDocumentStore(\n",
+    "    path=str(proj_dir/'Qdrant'),\n",
+    "    index=\"RAGDemo\",\n",
+    "    embedding_dim=768,\n",
+    "    recreate_index=True,\n",
+    "    hnsw_config={\"m\": 16, \"ef_construct\": 64}  # Optional\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "55fbcd5d-922c-4e93-a37a-974ba84464ac",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "270000it [28:43, 156.68it/s]                                                                                                          "
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 13min 23s, sys: 48.6 s, total: 14min 12s\n",
+      "Wall time: 28min 43s\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "document_store.write_documents(documents, batch_size=5_000)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9a073815-0191-48f7-890f-a4e4ecc0f9f1",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

requirements.txt CHANGED Viewed

@@ -1,5 +1,6 @@
 wikiextractor==3.0.6
 farm-haystack[inference]==1.20.1
 ipywidgets==8.1.1
 tqdm==4.66.1
 aiohttp-3.8.6

 wikiextractor==3.0.6
 farm-haystack[inference]==1.20.1
+qdrant-haystack==1.0.11
 ipywidgets==8.1.1
 tqdm==4.66.1
 aiohttp-3.8.6