Spaces:

derek-thomas
/

RAGDemo

Runtime error

File size: 5,824 Bytes

5b7578c

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6a151ade-7d86-4a2e-bfe7-462089f4e04c",
   "metadata": {},
   "source": [
    "# Approach\n",
    "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
    "\n",
    "Im targeting a demo (low utilization, latency can be relaxed) that will live on a huggingface space. I have a small scale that could even fit in memory. I like [Qdrant](https://qdrant.tech) for this. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
   "metadata": {},
   "source": [
    "# Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "88408486-566a-4791-8ef2-5ee3e6941156",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = 'all'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import pickle\n",
    "\n",
    "from tqdm.notebook import tqdm\n",
    "from haystack.schema import Document\n",
    "from qdrant_haystack import QdrantDocumentStore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ec2-user/RAGDemo\n"
     ]
    }
   ],
   "source": [
    "proj_dir = Path.cwd().parent\n",
    "print(proj_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
   "metadata": {},
   "source": [
    "# Config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "file_in = proj_dir / 'data/processed/simple_wiki_embeddings.pkl'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2dd0df0-4274-45b3-9ee5-0205494e4d75",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Setup\n",
    "Read in our list of dictionaries. This is the upper end for the machine Im using. This takes ~10GB of RAM. We could easily do this in batches of ~100k and be fine in most machines. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3c08e039-3686-4eca-9f87-7c469e3f19bc",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 11.6 s, sys: 2.25 s, total: 13.9 s\n",
      "Wall time: 18.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "with open(file_in, 'rb') as handle:\n",
    "    documents = pickle.load(handle)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98aec715-8d97-439e-99c0-0eff63df386b",
   "metadata": {},
   "source": [
    "Convert the dictionaries to `Documents`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4821e3c1-697d-4b69-bae3-300168755df9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "documents = [Document.from_dict(d) for d in documents]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "676f644c-fb09-4d17-89ba-30c92aad8777",
   "metadata": {},
   "source": [
    "Instantiate our `DocumentStore`. Note that Im saving this to disk, this is for portability which is good considering I want to move from this ec2 instance into a Hugging Face Space. \n",
    "\n",
    "Note that if you are doing this at scale, you should use a proper instance and not saving to file. You should also take a [measured ingestion](https://qdrant.tech/documentation/tutorials/bulk-upload/) approach instead of using a convenient loader. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e51b6e19-3be8-4cb0-8b65-9d6f6121f660",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "document_store = QdrantDocumentStore(\n",
    "    path=str(proj_dir/'Qdrant'),\n",
    "    index=\"RAGDemo\",\n",
    "    embedding_dim=768,\n",
    "    recreate_index=True,\n",
    "    hnsw_config={\"m\": 16, \"ef_construct\": 64}  # Optional\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "55fbcd5d-922c-4e93-a37a-974ba84464ac",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "270000it [28:43, 156.68it/s]                                                                                                          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 13min 23s, sys: 48.6 s, total: 14min 12s\n",
      "Wall time: 28min 43s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "document_store.write_documents(documents, batch_size=5_000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a073815-0191-48f7-890f-a4e4ecc0f9f1",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}