victormiller commited on
Commit
e01353c
1 Parent(s): c2b0703

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +5 -3
common.py CHANGED
@@ -258,21 +258,23 @@ global_div = Div(
258
  ),
259
  Section(
260
  H3("MinHash Generation"),
261
- P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage. This step produced 20 TB of hashes."),
 
262
  ),
263
  Section(
264
  H3("Matching Pairs Generation"),
265
  P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
266
  P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
267
  Pre(Code(dask_algo,)),
268
- P("This step produced 9.2 TB of matching pairs from all bands."),
269
  ),
270
  Section(
271
  H3("Finding Duplicate Pairs"),
272
  P("Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."),
273
  P("To address this, we use a Bloom filter with a capacity of 64 billion and a false positive rate of 0.001 to remove duplicates. One way we parallelize the Bloom filter execution is by partitioning pairs horizontally and running one filter per partition, as shown in the table below. There is a high chance that duplicates from different bands will have the same pairs in the same horizontal partition. This step reduces the number of pairs by nearly ninefold."),
274
  table_div_bloom_examples,
275
- P("The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches. This step produced 1.9 TB of unique pairs."),
 
276
  ),
277
  Section(
278
  H3("Finding Connected Components using MapReduce"),
 
258
  ),
259
  Section(
260
  H3("MinHash Generation"),
261
+ P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage.),
262
+ P(B("This step produced 20 TB of hashes.")),
263
  ),
264
  Section(
265
  H3("Matching Pairs Generation"),
266
  P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
267
  P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
268
  Pre(Code(dask_algo,)),
269
+ P(B("This step produced 9.2 TB of matching pairs from all bands.")),
270
  ),
271
  Section(
272
  H3("Finding Duplicate Pairs"),
273
  P("Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."),
274
  P("To address this, we use a Bloom filter with a capacity of 64 billion and a false positive rate of 0.001 to remove duplicates. One way we parallelize the Bloom filter execution is by partitioning pairs horizontally and running one filter per partition, as shown in the table below. There is a high chance that duplicates from different bands will have the same pairs in the same horizontal partition. This step reduces the number of pairs by nearly ninefold."),
275
  table_div_bloom_examples,
276
+ P("The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."),
277
+ P(B("This step produced 1.9 TB of unique pairs.")),
278
  ),
279
  Section(
280
  H3("Finding Connected Components using MapReduce"),