victormiller
commited on
Commit
•
c535408
1
Parent(s):
db08107
Update common.py
Browse files
common.py
CHANGED
@@ -299,7 +299,8 @@ global_div = Div(
|
|
299 |
style="margin-bottom: 5px",
|
300 |
),
|
301 |
Li("Normailzation Form C Discussion", style="margin-bottom: 5px"),
|
302 |
-
),
|
|
|
303 |
),
|
304 |
Section(
|
305 |
H2("Motivation Behind Global Deduplication"),
|
@@ -331,16 +332,18 @@ global_div = Div(
|
|
331 |
P(
|
332 |
"Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"
|
333 |
),
|
|
|
334 |
),
|
335 |
Section(
|
336 |
-
H3("
|
337 |
P(
|
338 |
"We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before caluclating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
|
339 |
),
|
340 |
P(B("This step produced 20 TB of hashes.")),
|
|
|
341 |
),
|
342 |
Section(
|
343 |
-
H3("
|
344 |
P(
|
345 |
"We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."
|
346 |
),
|
@@ -349,9 +352,10 @@ global_div = Div(
|
|
349 |
),
|
350 |
D_code(dask_algo, block="block", language="python"),
|
351 |
P(B("This step produced 9.2 TB of matching pairs from all bands.")),
|
|
|
352 |
),
|
353 |
Section(
|
354 |
-
H3("
|
355 |
P(
|
356 |
"Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."
|
357 |
),
|
@@ -366,9 +370,10 @@ global_div = Div(
|
|
366 |
"The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."
|
367 |
),
|
368 |
P(B("This step produced 1.9 TB of unique pairs.")),
|
|
|
369 |
),
|
370 |
Section(
|
371 |
-
H3("
|
372 |
Img(src="images/findcc.svg", style="max-width: 100%;"),
|
373 |
P(
|
374 |
"The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."
|
@@ -380,6 +385,7 @@ global_div = Div(
|
|
380 |
"Below is the distribution of duplicate documents found across different dumps of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
|
381 |
),
|
382 |
plotly2fasthtml(dup_docs_count_graph()),
|
|
|
383 |
),
|
384 |
Section(
|
385 |
H3("Analysis of Near-Duplicate Clusters"),
|
@@ -424,6 +430,7 @@ global_div = Div(
|
|
424 |
style="list-style-type: none",
|
425 |
),
|
426 |
),
|
|
|
427 |
),
|
428 |
Section(
|
429 |
H2("Normalization Form C"),
|
@@ -443,13 +450,13 @@ global_div = Div(
|
|
443 |
style="list-style-type: none",
|
444 |
)
|
445 |
), # "background-color= gray" "color= blue" maybe add this later
|
|
|
446 |
),
|
447 |
Section(
|
448 |
H3("NFC Examples"),
|
449 |
table_div_nfc_examples,
|
450 |
),
|
451 |
Section(H3("Conclusion"), P("NEED TO UPDATE")),
|
452 |
-
id="section1"
|
453 |
)
|
454 |
|
455 |
|
|
|
299 |
style="margin-bottom: 5px",
|
300 |
),
|
301 |
Li("Normailzation Form C Discussion", style="margin-bottom: 5px"),
|
302 |
+
),
|
303 |
+
id="section1",
|
304 |
),
|
305 |
Section(
|
306 |
H2("Motivation Behind Global Deduplication"),
|
|
|
332 |
P(
|
333 |
"Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"
|
334 |
),
|
335 |
+
id="section2",
|
336 |
),
|
337 |
Section(
|
338 |
+
H3("MinHash Generation"),
|
339 |
P(
|
340 |
"We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before caluclating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
|
341 |
),
|
342 |
P(B("This step produced 20 TB of hashes.")),
|
343 |
+
id="section3",
|
344 |
),
|
345 |
Section(
|
346 |
+
H3("Matching Pairs Generation"),
|
347 |
P(
|
348 |
"We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."
|
349 |
),
|
|
|
352 |
),
|
353 |
D_code(dask_algo, block="block", language="python"),
|
354 |
P(B("This step produced 9.2 TB of matching pairs from all bands.")),
|
355 |
+
id="section4",
|
356 |
),
|
357 |
Section(
|
358 |
+
H3("Finding Duplicate Pairs"),
|
359 |
P(
|
360 |
"Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."
|
361 |
),
|
|
|
370 |
"The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."
|
371 |
),
|
372 |
P(B("This step produced 1.9 TB of unique pairs.")),
|
373 |
+
id="section5",
|
374 |
),
|
375 |
Section(
|
376 |
+
H3("Finding Connected Components using MapReduce"),
|
377 |
Img(src="images/findcc.svg", style="max-width: 100%;"),
|
378 |
P(
|
379 |
"The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."
|
|
|
385 |
"Below is the distribution of duplicate documents found across different dumps of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
|
386 |
),
|
387 |
plotly2fasthtml(dup_docs_count_graph()),
|
388 |
+
id="section6",
|
389 |
),
|
390 |
Section(
|
391 |
H3("Analysis of Near-Duplicate Clusters"),
|
|
|
430 |
style="list-style-type: none",
|
431 |
),
|
432 |
),
|
433 |
+
id="section7",
|
434 |
),
|
435 |
Section(
|
436 |
H2("Normalization Form C"),
|
|
|
450 |
style="list-style-type: none",
|
451 |
)
|
452 |
), # "background-color= gray" "color= blue" maybe add this later
|
453 |
+
id="section8",
|
454 |
),
|
455 |
Section(
|
456 |
H3("NFC Examples"),
|
457 |
table_div_nfc_examples,
|
458 |
),
|
459 |
Section(H3("Conclusion"), P("NEED TO UPDATE")),
|
|
|
460 |
)
|
461 |
|
462 |
|