File size: 15,082 Bytes
6f49ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649fdfc
6f49ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649fdfc
6f49ee6
 
 
 
 
 
 
649fdfc
6f49ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649fdfc
 
 
 
 
 
 
 
 
 
6f49ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649fdfc
 
6f49ee6
649fdfc
6f49ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649fdfc
6f49ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649fdfc
 
 
6f49ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649fdfc
6f49ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649fdfc
6f49ee6
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
   "metadata": {},
   "source": [
    "# Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import json\n",
    "from pathlib import Path\n",
    "import pickle\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "from haystack.nodes.preprocessor import PreProcessor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ec2-user/RAGDemo\n"
     ]
    }
   ],
   "source": [
    "proj_dir = Path.cwd().parent\n",
    "print(proj_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
   "metadata": {},
   "source": [
    "# Config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "file_in = proj_dir / 'data/consolidated/simple_wiki.json'\n",
    "file_out = proj_dir / 'data/processed/simple_wiki_processed.pkl'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a643cf2-abce-48a9-b4e0-478bcbee28c3",
   "metadata": {},
   "source": [
    "# Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8f9630e-447e-423e-9f6c-e1dbc654f2dd",
   "metadata": {},
   "source": [
    "Its important to choose good pre-processing options. \n",
    "\n",
    "Clean whitespace helps each stage of RAG. It adds noise to the embeddings, and wastes space when we prompt with it.\n",
    "\n",
    "I chose to split by word as it would be tedious to tokenize here, and that doesnt scale well. The context length for most embedding models ends up being 512 tokens. This is ~400 words. \n",
    "\n",
    "I like to respect the sentence boundary, thats why I gave a ~50 word buffer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "18807aea-24e4-4d74-bf10-55b24f3cb52c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
     ]
    }
   ],
   "source": [
    "pp = PreProcessor(clean_whitespace = True,\n",
    "             clean_header_footer = False,\n",
    "             clean_empty_lines = True,\n",
    "             remove_substrings = None,\n",
    "             split_by='word',\n",
    "             split_length = 350,\n",
    "             split_overlap = 50,\n",
    "             split_respect_sentence_boundary = True,\n",
    "             tokenizer_model_folder = None,\n",
    "             language = \"en\",\n",
    "             id_hash_keys = None,\n",
    "             progress_bar = True,\n",
    "             add_page_number = False,\n",
    "             max_chars_check = 10_000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "dab1658a-79a7-40f2-9a8c-1798e0d124bf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(file_in, 'r', encoding='utf-8') as f:\n",
    "    list_of_articles = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4ca6e576-4b7d-4c1a-916f-41d1b82be647",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Preprocessing:   0%|β–Œ                                                                                                                      | 1551/332023 [00:02<09:44, 565.82docs/s]We found one or more sentences whose word count is higher than the split length.\n",
      "Preprocessing:  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                   | 276427/332023 [02:12<00:20, 2652.57docs/s]Document 81972e5bc1997b1ed4fb86d17f061a41 is 21206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.\n",
      "Document 5e63e848e42966ddc747257fb7cf4092 is 11206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.\n",
      "Preprocessing: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 332023/332023 [02:29<00:00, 2219.16docs/s]\n"
     ]
    }
   ],
   "source": [
    "documents = pp.process(list_of_articles)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f00dbdb2-906f-4d5a-a3f1-b0d84385d85a",
   "metadata": {},
   "source": [
    "When we break a wikipedia article up, we lose some of the context. The local context is somewhat preserved by the `split_overlap`. Im trying to preserve the global context by adding a prefix that has the article's title.\n",
    "\n",
    "You could enhance this with the summary as well. This is mostly to help the retrieval step of RAG. Note that the way Im doing it alters some of `haystack`'s features like the hash and the lengths, but those arent too necessary. \n",
    "\n",
    "A more advanced way for many business applications would be to summarize the document and add that as a prefix for sub-documents.\n",
    "\n",
    "One last thing to note, is that it would be prudent (in some use-cases) to preserve the original document without the summary to give to the reader (retrieve with the summary but prompt without), but since this is a simple use-case I wont be doing that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "076e115d-3e88-49d2-bc5d-f725a94e4964",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ba764e7bf29f4202a74e08576a29f4e4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/268980 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Prefix each document's content\n",
    "for document in tqdm(documents):\n",
    "    if document.meta['_split_id'] != 0:\n",
    "        document.content = f'Title: {document.meta[\"title\"]}. ' + document.content"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72c1849c-1f4d-411f-b74b-6208b1e48217",
   "metadata": {},
   "source": [
    "## Pre-processing Examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "02c1c6c8-6283-49a8-9d29-c355f1b08540",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Document: {'content': \"April (Apr.) is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of the four months to have 30 days.\\nApril always begins on the same day of the week as July, and additionally, January in leap years. April always ends on the same day of the week as December.\\nThe Month.\\nApril comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.\\nApril begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.\\nIn common years, April starts on the same day of the week as October of the previous year, and in leap years, May of the previous year. In common years, April finishes on the same day of the week as July of the previous year, and in leap years, February and October of the previous year. In common years immediately after other common years, April starts on the same day of the week as January of the previous year, and in leap years and years immediately after that, April finishes on the same day of the week as January of the previous year.\\nIn years immediately before common years, April starts on the same day of the week as September and December of the following year, and in years immediately before leap years, June of the following year. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.\\nApril is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. \", 'content_type': 'text', 'score': None, 'meta': {'id': '1', 'revid': '9086769', 'url': 'https://simple.wikipedia.org/wiki?curid=1', 'title': 'April', '_split_id': 0, '_split_overlap': [{'doc_id': '79a74c1e6444dd0a1acd72840e9dd7c0', 'range': (1529, 1835)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': 'a1c2acf337dbc3baa6f7f58403dfb95d'}>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b34890bf-9dba-459a-9b0d-aa4b5929cbe8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Document: {'content': 'Title: April. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.\\nApril is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. In each hemisphere, it is the seasonal equivalent of October in the other.\\nIt is unclear as to where April got its name. A common theory is that it comes from the Latin word \"aperire\", meaning \"to open\", referring to flowers opening in spring. Another theory is that the name could come from Aphrodite, the Greek goddess of love. It was originally the second month in the old Roman Calendar, before the start of the new year was put to January 1.\\nQuite a few festivals are held in this month. In many Southeast Asian cultures, new year is celebrated in this month (including Songkran). In Western Christianity, Easter can be celebrated on a Sunday between March 22 and April 25. In Orthodox Christianity, it can fall between April 4 and May 8. At the end of the month, Central and Northern European cultures celebrate Walpurgis Night on April 30, marking the transition from winter into summer.\\nApril in poetry.\\nPoets use \"April\" to mean the end of winter. For example: \"April showers bring May flowers.\"', 'content_type': 'text', 'score': None, 'meta': {'id': '1', 'revid': '9086769', 'url': 'https://simple.wikipedia.org/wiki?curid=1', 'title': 'April', '_split_id': 1, '_split_overlap': [{'doc_id': 'a1c2acf337dbc3baa6f7f58403dfb95d', 'range': (0, 306)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '79a74c1e6444dd0a1acd72840e9dd7c0'}>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e6f50c27-a486-47e9-ba60-d567f5e530db",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Document: {'content': 'Title: Chief Joseph. He knew he could not trust them anymore. He was tired of being considered a savage. He felt it was not fair for people who were born on the same land to be treated differently. He delivered a lot of speeches on this subject, which are still really good examples of eloquence. But he did not feel listened to, and when he died in his reservation in 1904, the doctor said he \"died from sadness\". He was buried in Colville Native American Burial Ground, in Washington State.', 'content_type': 'text', 'score': None, 'meta': {'id': '19310', 'revid': '16695', 'url': 'https://simple.wikipedia.org/wiki?curid=19310', 'title': 'Chief Joseph', '_split_id': 1, '_split_overlap': [{'doc_id': '4bdf9cecd46c3bfac6b225aed940e798', 'range': (0, 275)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '91bc8240c5d067ab24f35c11f8916fc6'}>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[10102]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5485cc27-3d3f-4b96-8884-accf5324da2d",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of Articles: 332023\n",
      "Number of processed articles: 237724\n",
      "Number of processed documents: 268980\n"
     ]
    }
   ],
   "source": [
    "print(f'Number of Articles: {len(list_of_articles)}')\n",
    "processed_articles = len([d for d in documents if d.meta['_split_id'] == 0])\n",
    "print(f'Number of processed articles: {processed_articles}')\n",
    "print(f'Number of processed documents: {len(documents)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23ce57a8-d14e-426d-abc2-0ce5cdbc881a",
   "metadata": {},
   "source": [
    "# Write to file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0d044870-7a30-4e09-aad2-42f24a52780d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(file_out, 'wb') as handle:\n",
    "    pickle.dump(documents, handle, protocol=pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5833dba-1bf6-48aa-be6f-0d70c71e54aa",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}