victormiller
commited on
Commit
•
c9d881b
1
Parent(s):
8bb7e3f
Update common.py
Browse files
common.py
CHANGED
@@ -265,7 +265,7 @@ global_div = Div(
|
|
265 |
H3("Matching Pairs Generation"),
|
266 |
P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
|
267 |
P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
|
268 |
-
|
269 |
P(B("This step produced 9.2 TB of matching pairs from all bands.")),
|
270 |
),
|
271 |
Section(
|
@@ -303,7 +303,7 @@ global_div = Div(
|
|
303 |
H3("Removing PII"),
|
304 |
P("We have removed two types of PII from the dataset: email address and IP address. Regular expressions are used to identify and replace these PII with a generic placeholder. Below is an example of how we removed email addresses from the dataset:"),
|
305 |
P("We have used the following regular expressions to identify and replace PII:"),
|
306 |
-
Ul(Li("Email:"), Li(email_code, style="list-style-type: none"), Li("IP Address:"), Li(ip_address_code, style="list-style-type: none")),
|
307 |
),
|
308 |
Section(
|
309 |
H2("Normalization Form C (NFC)"),
|
@@ -313,7 +313,7 @@ global_div = Div(
|
|
313 |
Section(
|
314 |
H3("NFC Implementation"),
|
315 |
P("We have used the ftfy library to normalize the text to NFC. The library provides a simple API for normalizing text to NFC, which can be applied to the entire dataset one row at a time. Below is the code snippet about how we normalized text to NFC:"),
|
316 |
-
Ul(Li(nfc_code, style="list-style-type: none")), #"background-color= gray" "color= blue" maybe add this later
|
317 |
),
|
318 |
Section(
|
319 |
H3("NFC Examples"),
|
|
|
265 |
H3("Matching Pairs Generation"),
|
266 |
P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
|
267 |
P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
|
268 |
+
D_code(dask_algo, block="block", language="python"),
|
269 |
P(B("This step produced 9.2 TB of matching pairs from all bands.")),
|
270 |
),
|
271 |
Section(
|
|
|
303 |
H3("Removing PII"),
|
304 |
P("We have removed two types of PII from the dataset: email address and IP address. Regular expressions are used to identify and replace these PII with a generic placeholder. Below is an example of how we removed email addresses from the dataset:"),
|
305 |
P("We have used the following regular expressions to identify and replace PII:"),
|
306 |
+
Ul(Li("Email:"), Li(D_code(email_code, block="block", language="python"), style="list-style-type: none"), Li("IP Address:"), Li(D_code(ip_address_code, block="block", language="python"), style="list-style-type: none")),
|
307 |
),
|
308 |
Section(
|
309 |
H2("Normalization Form C (NFC)"),
|
|
|
313 |
Section(
|
314 |
H3("NFC Implementation"),
|
315 |
P("We have used the ftfy library to normalize the text to NFC. The library provides a simple API for normalizing text to NFC, which can be applied to the entire dataset one row at a time. Below is the code snippet about how we normalized text to NFC:"),
|
316 |
+
Ul(Li(D_code(nfc_code, block="block", language="python"), style="list-style-type: none")), #"background-color= gray" "color= blue" maybe add this later
|
317 |
),
|
318 |
Section(
|
319 |
H3("NFC Examples"),
|