victormiller commited on
Commit
51e13a8
1 Parent(s): 1ac16da

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +27 -55
web.py CHANGED
@@ -310,8 +310,8 @@ def web_data():
310
  ),
311
  #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
312
 
313
- H3("1.2 Language Identification"),
314
- P("""
315
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
316
  This step removes over 60% of the whole data.
317
  """),
@@ -347,13 +347,13 @@ def web_data():
347
  """,
348
  ),
349
 
350
- H3("1.3 URL Filtering"),
351
- P("""
352
  The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
353
  out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated curated data (e.g. wikipedia.org) to avoid duplication.
354
  """),
355
- H3("1.3.1 URL Blocklist"),
356
- P("""
357
  Following RefinedWeb [3], we manually inspected the UT1 blocklist to reduce false positives like news
358
  articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
359
  4.6M domain names in the UT1 blocklist. Of note, 24 URLs were detected with more than 4k matches and are shown below.
@@ -407,8 +407,7 @@ def web_data():
407
  """,
408
  ),
409
 
410
- H3("1.3.2 Excluded High Quality Sources"),
411
- P("""
412
  To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
413
  """),
414
 
@@ -449,8 +448,7 @@ def web_data():
449
  Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
450
  This ensured that computing quality signals would align with the final kept texts.
451
  """),
452
- H3("Terminal Punctuation"),
453
- P("""
454
  The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
455
  punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
456
  lines, especially when the text extraction tool “trafilatura”.
@@ -481,8 +479,7 @@ def web_data():
481
  ),
482
 
483
 
484
- H3('2.1 Word "Javascript"'),
485
- P("""
486
  In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
487
  pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
488
  strict, which will filter out many lines that are really talking about “Javascript”.
@@ -509,8 +506,7 @@ def web_data():
509
  margin-bottom: 15px
510
  """,
511
  ),
512
- H3("2.2 Other Rules from RefinedWeb"),
513
- P("""
514
  We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
515
  """),
516
  Ul(
@@ -536,8 +532,7 @@ def web_data():
536
  margin-bottom: 15px
537
  """,
538
  ),
539
- H3("2.3 Toxic Lines"),
540
- P("""
541
  When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
542
  document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
543
  by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
@@ -589,13 +584,11 @@ def web_data():
589
  In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
590
  and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
591
  """),
592
- H3("3.1 Repetition-based Heuristics"),
593
- P("""
594
  Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
595
  work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
596
  """),
597
- H3("3.1.1 Fraction of (Characters in) Repeated Lines"),
598
- P("""
599
  Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
600
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
601
  that are duplicates, and the fraction of characters contained within those duplicated passages.
@@ -675,20 +668,17 @@ def web_data():
675
  After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
676
  signals), we have made the following decisions:
677
  """),
678
- H3("Passage Separation"),
679
- P("""
680
  Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
681
  symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
682
  one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
683
  opting instead to use a single newline symbol to segment the text into passages.
684
  """),
685
- H3("First Occurrence"),
686
- P("""
687
  In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
688
  helps retain a larger number of documents.
689
  """),
690
- H3("Character Count"),
691
- P("""
692
  We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
693
  ensures consistency with the overall document character count calculation.
694
  """),
@@ -738,8 +728,7 @@ def web_data():
738
  margin-bottom: 15px
739
  """,
740
  ),
741
- H3("3.1.2 Fraction of Characters in the Most Common N-grams (n=2,3,4)"),
742
- P("""
743
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
744
  fraction of characters contained within the most frequently-occurring n-gram.
745
  """),
@@ -902,8 +891,7 @@ def web_data():
902
  margin-bottom: 15px
903
  """,
904
  ),
905
- H3("3.1.3 Fraction of Characters in Duplicated N-grams (n=5,...,10)"),
906
- P("""
907
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
908
  fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
909
  overlapping n-grams more than once.
@@ -1135,8 +1123,7 @@ def web_data():
1135
  margin-bottom: 15px
1136
  """,
1137
  ),
1138
- H3("3.2 Line-wise Heuristics"),
1139
- P("""
1140
  Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
1141
  RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
1142
  works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
@@ -1243,8 +1230,9 @@ def web_data():
1243
  """,
1244
  ),
1245
 
1246
- H3("3.3 Statistics-based Heuristics"),
1247
- P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
 
1248
  Ul(
1249
  Li("the word count in the document", style = "margin-bottom: 5px"),
1250
  Li("the mean word length", style = "margin-bottom: 5px"),
@@ -1338,8 +1326,7 @@ def web_data():
1338
  We decided to use simple `len(text.split())` to compute the word count.
1339
  """),
1340
 
1341
- H3("Mean Word Length"),
1342
- P("""
1343
  There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
1344
  """),
1345
  D_code("""
@@ -1355,8 +1342,7 @@ def web_data():
1355
  from statistics import median
1356
  median_word_length = median(len(word) for word in words)
1357
  """, block="block", language="python"),
1358
- H3("Number of Sentences"),
1359
- P("""
1360
  The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
1361
  to split text into sentences.
1362
  """),
@@ -1410,8 +1396,7 @@ def web_data():
1410
  """,
1411
  ),
1412
 
1413
- H3("Symbol to Word Ratio"),
1414
- P("""
1415
  Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
1416
  We calculate the ratio as the number of symbols divided by the total number of words.
1417
  """),
@@ -1584,8 +1569,7 @@ def web_data():
1584
  RedPajama-V2 employs regular expressions for this purpose. We opt to use regular expressions since `char.isalpha()`
1585
  can also match words in other languages as long as they are not punctuations.
1586
  """),
1587
- H5("Number of Stop Words"),
1588
- P("""
1589
  The implementations across existing pipelines are largely identical. We adopt them and apply them to our pipeline.
1590
  """),
1591
  D_code("""
@@ -1614,8 +1598,7 @@ def web_data():
1614
  margin-bottom: 15px
1615
  """,
1616
  ),
1617
- H3("3.4 Others"),
1618
- P("""
1619
  Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
1620
  text.
1621
  """),
@@ -1633,15 +1616,4 @@ def web_data():
1633
  margin-bottom: 15px
1634
  """,
1635
  ),
1636
- H2("4. Deduplication"),
1637
- P("""
1638
- After careful filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, to boilerplate content being repeated (e.g., licences), or even to plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization.
1639
- """), # Add detailed content and images as needed
1640
- P("We perform two-level deduplication: local exact deduplication and global fuzzy deduplication"),
1641
- P(B("Local Exact Deduplication")),
1642
- P("To reduce the expensive cost of global deduplication, we apply a local exact deduplication before it. Specifically, each dump is split into 70 splits. A bloom filter is applied within each split."),
1643
- P(B("Global Fuzzy Deduplication")),
1644
- P("NEED TO UPDATE"),
1645
- H2("5. PII Removal"),
1646
- P("..."), # Add detailed content and images as needed
1647
  )
 
310
  ),
311
  #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
312
 
313
+
314
+ P(B("Language Identification: "), """
315
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
316
  This step removes over 60% of the whole data.
317
  """),
 
347
  """,
348
  ),
349
 
350
+
351
+ P(B("URL Filtering: "), """
352
  The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
353
  out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated curated data (e.g. wikipedia.org) to avoid duplication.
354
  """),
355
+
356
+ P(B("URL Blocklist: "), """
357
  Following RefinedWeb [3], we manually inspected the UT1 blocklist to reduce false positives like news
358
  articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
359
  4.6M domain names in the UT1 blocklist. Of note, 24 URLs were detected with more than 4k matches and are shown below.
 
407
  """,
408
  ),
409
 
410
+ P(B("Excluded High Quality Sources: "), """
 
411
  To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
412
  """),
413
 
 
448
  Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
449
  This ensured that computing quality signals would align with the final kept texts.
450
  """),
451
+ P(B("Terminal Punctuation: "), """
 
452
  The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
453
  punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
454
  lines, especially when the text extraction tool “trafilatura”.
 
479
  ),
480
 
481
 
482
+ P(B('"Word "Javascript"'), """
 
483
  In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
484
  pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
485
  strict, which will filter out many lines that are really talking about “Javascript”.
 
506
  margin-bottom: 15px
507
  """,
508
  ),
509
+ P(B("Other Rules from RefinedWeb: "), """
 
510
  We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
511
  """),
512
  Ul(
 
532
  margin-bottom: 15px
533
  """,
534
  ),
535
+ P(B("Toxic Lines: "), """
 
536
  When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
537
  document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
538
  by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
 
584
  In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
585
  and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
586
  """),
587
+ P(B("Repetition-based Heuristics: "), """
 
588
  Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
589
  work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
590
  """),
591
+ P(B("Fraction of Characters in Repeated Lines: "), """
 
592
  Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
593
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
594
  that are duplicates, and the fraction of characters contained within those duplicated passages.
 
668
  After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
669
  signals), we have made the following decisions:
670
  """),
671
+ P(B("Passage Separation: "), """
 
672
  Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
673
  symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
674
  one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
675
  opting instead to use a single newline symbol to segment the text into passages.
676
  """),
677
+ P(B("First Occurrence: "), """
 
678
  In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
679
  helps retain a larger number of documents.
680
  """),
681
+ P(B("Character Count: "), """
 
682
  We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
683
  ensures consistency with the overall document character count calculation.
684
  """),
 
728
  margin-bottom: 15px
729
  """,
730
  ),
731
+ P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
 
732
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
733
  fraction of characters contained within the most frequently-occurring n-gram.
734
  """),
 
891
  margin-bottom: 15px
892
  """,
893
  ),
894
+ P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
 
895
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
896
  fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
897
  overlapping n-grams more than once.
 
1123
  margin-bottom: 15px
1124
  """,
1125
  ),
1126
+ P(B("Line-wise Heuristics: "), """
 
1127
  Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
1128
  RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
1129
  works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
 
1230
  """,
1231
  ),
1232
 
1233
+ P(B("Statistics-based Heuristics: "), """
1234
+ We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:
1235
+ """),
1236
  Ul(
1237
  Li("the word count in the document", style = "margin-bottom: 5px"),
1238
  Li("the mean word length", style = "margin-bottom: 5px"),
 
1326
  We decided to use simple `len(text.split())` to compute the word count.
1327
  """),
1328
 
1329
+ P(B("Mean Word Length: "), """
 
1330
  There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
1331
  """),
1332
  D_code("""
 
1342
  from statistics import median
1343
  median_word_length = median(len(word) for word in words)
1344
  """, block="block", language="python"),
1345
+ P(B("Number of Sentences: "), """
 
1346
  The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
1347
  to split text into sentences.
1348
  """),
 
1396
  """,
1397
  ),
1398
 
1399
+ P(B("Symbol to Word Ratio: "), """
 
1400
  Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
1401
  We calculate the ratio as the number of symbols divided by the total number of words.
1402
  """),
 
1569
  RedPajama-V2 employs regular expressions for this purpose. We opt to use regular expressions since `char.isalpha()`
1570
  can also match words in other languages as long as they are not punctuations.
1571
  """),
1572
+ P(B("Number of Stop Words: "), """
 
1573
  The implementations across existing pipelines are largely identical. We adopt them and apply them to our pipeline.
1574
  """),
1575
  D_code("""
 
1598
  margin-bottom: 15px
1599
  """,
1600
  ),
1601
+ P(B("Additional Filters: "), """
 
1602
  Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
1603
  text.
1604
  """),
 
1616
  margin-bottom: 15px
1617
  """,
1618
  ),
 
 
 
 
 
 
 
 
 
 
 
1619
  )