omkarenator
commited on
Commit
•
0cc9c9f
1
Parent(s):
a622849
fix images
Browse files
common.py
CHANGED
@@ -298,7 +298,8 @@ global_div = Div(
|
|
298 |
"To illustrate the need for deduplication, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents."
|
299 |
),
|
300 |
plotly2fasthtml(dup_cluster_graph()),
|
301 |
-
|
|
|
302 |
P(
|
303 |
"We started deduplication with 61.8 TB of filtered and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."
|
304 |
),
|
@@ -351,7 +352,7 @@ global_div = Div(
|
|
351 |
),
|
352 |
Section(
|
353 |
H3("Stage 4: Finding Connected Components using MapReduce"),
|
354 |
-
Img(src="images/cc.png",
|
355 |
P(
|
356 |
"The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."
|
357 |
),
|
@@ -368,15 +369,15 @@ global_div = Div(
|
|
368 |
P(
|
369 |
"Smaller components tend to have more overlap in their MinHash bands. The smallest components are almost exact pairs but due to small differences, were not included in the local exact deduplication."
|
370 |
),
|
371 |
-
Img(src="images/image3.png",
|
372 |
P(
|
373 |
"Changes in text are incremental from buckets of 3 or more documents onwards. The example below shows a personnel list that has grown over the years."
|
374 |
),
|
375 |
-
Img(src="images/image7.png",
|
376 |
P(
|
377 |
"In sizable clusters comprising 1000 or more documents, we observe a trend towards templatization. This involves the recurrent use of standardized language to convey general topics such as terms and conditions, warnings, and disclaimers. Such language is prevalent on commercial websites, offering a consistent and efficient way to communicate commonly encountered information."
|
378 |
),
|
379 |
-
Img(src="images/image9.png",
|
380 |
),
|
381 |
Section(
|
382 |
H2("Personally Identifable Information Removal"),
|
|
|
298 |
"To illustrate the need for deduplication, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents."
|
299 |
),
|
300 |
plotly2fasthtml(dup_cluster_graph()),
|
301 |
+
P("The example below is from one such cluster. Here most of the text is repeated with just specifics changed."),
|
302 |
+
Img(src="images/100k.png", style="max-width: 100%;"),
|
303 |
P(
|
304 |
"We started deduplication with 61.8 TB of filtered and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."
|
305 |
),
|
|
|
352 |
),
|
353 |
Section(
|
354 |
H3("Stage 4: Finding Connected Components using MapReduce"),
|
355 |
+
Img(src="images/cc.png", style="max-width: 100%;"),
|
356 |
P(
|
357 |
"The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."
|
358 |
),
|
|
|
369 |
P(
|
370 |
"Smaller components tend to have more overlap in their MinHash bands. The smallest components are almost exact pairs but due to small differences, were not included in the local exact deduplication."
|
371 |
),
|
372 |
+
Img(src="images/image3.png", style="max-width: 100%;"),
|
373 |
P(
|
374 |
"Changes in text are incremental from buckets of 3 or more documents onwards. The example below shows a personnel list that has grown over the years."
|
375 |
),
|
376 |
+
Img(src="images/image7.png", style="max-width: 100%;"),
|
377 |
P(
|
378 |
"In sizable clusters comprising 1000 or more documents, we observe a trend towards templatization. This involves the recurrent use of standardized language to convey general topics such as terms and conditions, warnings, and disclaimers. Such language is prevalent on commercial websites, offering a consistent and efficient way to communicate commonly encountered information."
|
379 |
),
|
380 |
+
Img(src="images/image9.png", style="max-width: 100%;"),
|
381 |
),
|
382 |
Section(
|
383 |
H2("Personally Identifable Information Removal"),
|