victormiller commited on
Commit
b1b2d47
1 Parent(s): 0e12ce8

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +7 -7
common.py CHANGED
@@ -298,7 +298,7 @@ global_div = Div(
298
  "Personally Identifiable Information Removal",
299
  style="margin-bottom: 5px",
300
  ),
301
- Li("Normailzation Form C Discussion", style="margin-bottom: 5px"),
302
  ),
303
  id="section1",
304
  ),
@@ -322,7 +322,7 @@ global_div = Div(
322
  "We started deduplication with 61.8 TB of filtered and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."
323
  ),
324
  P(
325
- "For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 99 Common Crawl dumps (also called “crawls”) and the curated data. The near-deduplication process involved generating signatures for every document, matching these signatures to identify near-duplicates, and then clustering the near-duplicate documents to select all but one for deletion."
326
  ),
327
  P("We applied the following inclusion criteria for all documents:"),
328
  Ul(
@@ -337,7 +337,7 @@ global_div = Div(
337
  Section(
338
  H3("MinHash Generation"),
339
  P(
340
- "We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before caluclating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
341
  ),
342
  P(B("This step produced 20 TB of hashes.")),
343
  id="section3",
@@ -387,7 +387,7 @@ global_div = Div(
387
  "We needed to partition the duplicate pairs generated in the third stage into three groups to reduce memory pressure on the final stage. We observed that the second stage itself generates partial components which have some overlap. These overlapping clusters cause some documents to appear in the delete set multiple times. However, our deletion code handled this overlap."
388
  ),
389
  P(
390
- "Below is the distribution of duplicate documents found across different dumps of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
391
  ),
392
  plotly2fasthtml(dup_docs_count_graph()),
393
  id="section6",
@@ -408,10 +408,10 @@ global_div = Div(
408
  Img(src="images/image9.png", style="max-width: 100%;"),
409
  ),
410
  Section(
411
- H2("Personally Identifable Information Removal"),
412
- H3("Motivation Behind Personally Identifable Information Removal"),
413
  P(
414
- "Personally Identifable Information (PII) refers to any information that can be used to identify an individual, such as names, addresses, phone numbers, email addresses, and social security numbers. PII removal is essential for data privacy and security, as well as for compliance with global regulations. By removing PII from the training data, we can reduce the risk of data breaches and unauthorized access to sensitive information. Additionally, removing PII from training data prevents the models generating that specific PII during inference time."
415
  ),
416
  table_div_pii,
417
  ),
 
298
  "Personally Identifiable Information Removal",
299
  style="margin-bottom: 5px",
300
  ),
301
+ Li("Normalization Form C Discussion", style="margin-bottom: 5px"),
302
  ),
303
  id="section1",
304
  ),
 
322
  "We started deduplication with 61.8 TB of filtered and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."
323
  ),
324
  P(
325
+ "For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 99 Common Crawl snapshots (also called “crawls”) and the curated data. The near-deduplication process involved generating signatures for every document, matching these signatures to identify near-duplicates, and then clustering the near-duplicate documents to select all but one for deletion."
326
  ),
327
  P("We applied the following inclusion criteria for all documents:"),
328
  Ul(
 
337
  Section(
338
  H3("MinHash Generation"),
339
  P(
340
+ "We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before calculating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
341
  ),
342
  P(B("This step produced 20 TB of hashes.")),
343
  id="section3",
 
387
  "We needed to partition the duplicate pairs generated in the third stage into three groups to reduce memory pressure on the final stage. We observed that the second stage itself generates partial components which have some overlap. These overlapping clusters cause some documents to appear in the delete set multiple times. However, our deletion code handled this overlap."
388
  ),
389
  P(
390
+ "Below is the distribution of duplicate documents found across different snapshots of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
391
  ),
392
  plotly2fasthtml(dup_docs_count_graph()),
393
  id="section6",
 
408
  Img(src="images/image9.png", style="max-width: 100%;"),
409
  ),
410
  Section(
411
+ H2("Personally Identifiable Information Removal"),
412
+ H3("Motivation Behind Personally Identifiable Information Removal"),
413
  P(
414
+ "Personally Identifiable Information (PII) refers to any information that can be used to identify an individual, such as names, addresses, phone numbers, email addresses, and social security numbers. PII removal is essential for data privacy and security, as well as for compliance with global regulations. By removing PII from the training data, we can reduce the risk of data breaches and unauthorized access to sensitive information. Additionally, removing PII from training data prevents the models generating that specific PII during inference time."
415
  ),
416
  table_div_pii,
417
  ),