victormiller commited on
Commit
466af30
1 Parent(s): e01353c

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +1 -1
common.py CHANGED
@@ -258,7 +258,7 @@ global_div = Div(
258
  ),
259
  Section(
260
  H3("MinHash Generation"),
261
- P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage.),
262
  P(B("This step produced 20 TB of hashes.")),
263
  ),
264
  Section(
 
258
  ),
259
  Section(
260
  H3("MinHash Generation"),
261
+ P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage."),
262
  P(B("This step produced 20 TB of hashes.")),
263
  ),
264
  Section(