Spaces:
Runtime error
Runtime error
File size: 15,082 Bytes
6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 649fdfc 6f49ee6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 |
{
"cells": [
{
"cell_type": "markdown",
"id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
"metadata": {},
"source": [
"# Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import json\n",
"from pathlib import Path\n",
"import pickle\n",
"from tqdm.auto import tqdm\n",
"\n",
"from haystack.nodes.preprocessor import PreProcessor"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/ec2-user/RAGDemo\n"
]
}
],
"source": [
"proj_dir = Path.cwd().parent\n",
"print(proj_dir)"
]
},
{
"cell_type": "markdown",
"id": "76119e74-f601-436d-a253-63c5a19d1c83",
"metadata": {},
"source": [
"# Config"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"file_in = proj_dir / 'data/consolidated/simple_wiki.json'\n",
"file_out = proj_dir / 'data/processed/simple_wiki_processed.pkl'"
]
},
{
"cell_type": "markdown",
"id": "6a643cf2-abce-48a9-b4e0-478bcbee28c3",
"metadata": {},
"source": [
"# Preprocessing"
]
},
{
"cell_type": "markdown",
"id": "a8f9630e-447e-423e-9f6c-e1dbc654f2dd",
"metadata": {},
"source": [
"Its important to choose good pre-processing options. \n",
"\n",
"Clean whitespace helps each stage of RAG. It adds noise to the embeddings, and wastes space when we prompt with it.\n",
"\n",
"I chose to split by word as it would be tedious to tokenize here, and that doesnt scale well. The context length for most embedding models ends up being 512 tokens. This is ~400 words. \n",
"\n",
"I like to respect the sentence boundary, thats why I gave a ~50 word buffer."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "18807aea-24e4-4d74-bf10-55b24f3cb52c",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n",
"[nltk_data] Unzipping tokenizers/punkt.zip.\n"
]
}
],
"source": [
"pp = PreProcessor(clean_whitespace = True,\n",
" clean_header_footer = False,\n",
" clean_empty_lines = True,\n",
" remove_substrings = None,\n",
" split_by='word',\n",
" split_length = 350,\n",
" split_overlap = 50,\n",
" split_respect_sentence_boundary = True,\n",
" tokenizer_model_folder = None,\n",
" language = \"en\",\n",
" id_hash_keys = None,\n",
" progress_bar = True,\n",
" add_page_number = False,\n",
" max_chars_check = 10_000)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "dab1658a-79a7-40f2-9a8c-1798e0d124bf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"with open(file_in, 'r', encoding='utf-8') as f:\n",
" list_of_articles = json.load(f)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "4ca6e576-4b7d-4c1a-916f-41d1b82be647",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Preprocessing: 0%|β | 1551/332023 [00:02<09:44, 565.82docs/s]We found one or more sentences whose word count is higher than the split length.\n",
"Preprocessing: 83%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 276427/332023 [02:12<00:20, 2652.57docs/s]Document 81972e5bc1997b1ed4fb86d17f061a41 is 21206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.\n",
"Document 5e63e848e42966ddc747257fb7cf4092 is 11206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.\n",
"Preprocessing: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 332023/332023 [02:29<00:00, 2219.16docs/s]\n"
]
}
],
"source": [
"documents = pp.process(list_of_articles)"
]
},
{
"cell_type": "markdown",
"id": "f00dbdb2-906f-4d5a-a3f1-b0d84385d85a",
"metadata": {},
"source": [
"When we break a wikipedia article up, we lose some of the context. The local context is somewhat preserved by the `split_overlap`. Im trying to preserve the global context by adding a prefix that has the article's title.\n",
"\n",
"You could enhance this with the summary as well. This is mostly to help the retrieval step of RAG. Note that the way Im doing it alters some of `haystack`'s features like the hash and the lengths, but those arent too necessary. \n",
"\n",
"A more advanced way for many business applications would be to summarize the document and add that as a prefix for sub-documents.\n",
"\n",
"One last thing to note, is that it would be prudent (in some use-cases) to preserve the original document without the summary to give to the reader (retrieve with the summary but prompt without), but since this is a simple use-case I wont be doing that."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "076e115d-3e88-49d2-bc5d-f725a94e4964",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ba764e7bf29f4202a74e08576a29f4e4",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/268980 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Prefix each document's content\n",
"for document in tqdm(documents):\n",
" if document.meta['_split_id'] != 0:\n",
" document.content = f'Title: {document.meta[\"title\"]}. ' + document.content"
]
},
{
"cell_type": "markdown",
"id": "72c1849c-1f4d-411f-b74b-6208b1e48217",
"metadata": {},
"source": [
"## Pre-processing Examples"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "02c1c6c8-6283-49a8-9d29-c355f1b08540",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"<Document: {'content': \"April (Apr.) is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of the four months to have 30 days.\\nApril always begins on the same day of the week as July, and additionally, January in leap years. April always ends on the same day of the week as December.\\nThe Month.\\nApril comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.\\nApril begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.\\nIn common years, April starts on the same day of the week as October of the previous year, and in leap years, May of the previous year. In common years, April finishes on the same day of the week as July of the previous year, and in leap years, February and October of the previous year. In common years immediately after other common years, April starts on the same day of the week as January of the previous year, and in leap years and years immediately after that, April finishes on the same day of the week as January of the previous year.\\nIn years immediately before common years, April starts on the same day of the week as September and December of the following year, and in years immediately before leap years, June of the following year. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.\\nApril is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. \", 'content_type': 'text', 'score': None, 'meta': {'id': '1', 'revid': '9086769', 'url': 'https://simple.wikipedia.org/wiki?curid=1', 'title': 'April', '_split_id': 0, '_split_overlap': [{'doc_id': '79a74c1e6444dd0a1acd72840e9dd7c0', 'range': (1529, 1835)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': 'a1c2acf337dbc3baa6f7f58403dfb95d'}>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents[0]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b34890bf-9dba-459a-9b0d-aa4b5929cbe8",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"<Document: {'content': 'Title: April. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.\\nApril is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. In each hemisphere, it is the seasonal equivalent of October in the other.\\nIt is unclear as to where April got its name. A common theory is that it comes from the Latin word \"aperire\", meaning \"to open\", referring to flowers opening in spring. Another theory is that the name could come from Aphrodite, the Greek goddess of love. It was originally the second month in the old Roman Calendar, before the start of the new year was put to January 1.\\nQuite a few festivals are held in this month. In many Southeast Asian cultures, new year is celebrated in this month (including Songkran). In Western Christianity, Easter can be celebrated on a Sunday between March 22 and April 25. In Orthodox Christianity, it can fall between April 4 and May 8. At the end of the month, Central and Northern European cultures celebrate Walpurgis Night on April 30, marking the transition from winter into summer.\\nApril in poetry.\\nPoets use \"April\" to mean the end of winter. For example: \"April showers bring May flowers.\"', 'content_type': 'text', 'score': None, 'meta': {'id': '1', 'revid': '9086769', 'url': 'https://simple.wikipedia.org/wiki?curid=1', 'title': 'April', '_split_id': 1, '_split_overlap': [{'doc_id': 'a1c2acf337dbc3baa6f7f58403dfb95d', 'range': (0, 306)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '79a74c1e6444dd0a1acd72840e9dd7c0'}>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents[1]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "e6f50c27-a486-47e9-ba60-d567f5e530db",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"<Document: {'content': 'Title: Chief Joseph. He knew he could not trust them anymore. He was tired of being considered a savage. He felt it was not fair for people who were born on the same land to be treated differently. He delivered a lot of speeches on this subject, which are still really good examples of eloquence. But he did not feel listened to, and when he died in his reservation in 1904, the doctor said he \"died from sadness\". He was buried in Colville Native American Burial Ground, in Washington State.', 'content_type': 'text', 'score': None, 'meta': {'id': '19310', 'revid': '16695', 'url': 'https://simple.wikipedia.org/wiki?curid=19310', 'title': 'Chief Joseph', '_split_id': 1, '_split_overlap': [{'doc_id': '4bdf9cecd46c3bfac6b225aed940e798', 'range': (0, 275)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '91bc8240c5d067ab24f35c11f8916fc6'}>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents[10102]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "5485cc27-3d3f-4b96-8884-accf5324da2d",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of Articles: 332023\n",
"Number of processed articles: 237724\n",
"Number of processed documents: 268980\n"
]
}
],
"source": [
"print(f'Number of Articles: {len(list_of_articles)}')\n",
"processed_articles = len([d for d in documents if d.meta['_split_id'] == 0])\n",
"print(f'Number of processed articles: {processed_articles}')\n",
"print(f'Number of processed documents: {len(documents)}')"
]
},
{
"cell_type": "markdown",
"id": "23ce57a8-d14e-426d-abc2-0ce5cdbc881a",
"metadata": {},
"source": [
"# Write to file"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "0d044870-7a30-4e09-aad2-42f24a52780d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"with open(file_out, 'wb') as handle:\n",
" pickle.dump(documents, handle, protocol=pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5833dba-1bf6-48aa-be6f-0d70c71e54aa",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|