victormiller commited on
Commit
7b420f4
1 Parent(s): dcb73ca

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +88 -64
web.py CHANGED
@@ -352,6 +352,28 @@ attrs.fraction_of_characters_in_duplicate_lines = sum(
352
 
353
  def web_data():
354
  return Div(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
355
  Div(
356
  Ul(
357
  Li(
@@ -374,23 +396,17 @@ def web_data():
374
  padding: 15px 15px 0px 15px;
375
  """,
376
  ),
377
- Div(
378
- P(
379
- "To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Starting from ",
380
- A("Common Crawl", href="https://commoncrawl.org/"),
381
- ", our process can be summarized as five main steps: document preparation, line-level removal, document-level filtering, deduplication and PII removal.",
382
- ),
383
- style="margin-top: 20px;",
384
- ),
385
- H2("Web Data Processing Summary"),
386
  P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
387
  table_div_filter_data,
388
- P("ADD EXPLAINER TEXT ABOUT THE QUALITY FILTERS"),
389
  table_div_qf_filter_data,
390
  P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
391
  Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
392
  P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
393
- P("We also adopt rules from RefinedWeb [1] to remove lines if they satisfy any of the following criteria:"),
 
 
394
  Ul(
395
  Li("the line is only composed of uppercase characters", style = "margin-bottom: 5px"),
396
  Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
@@ -419,9 +435,9 @@ def web_data():
419
  P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
420
 
421
 
422
- H3("1. Document Preparation"),
423
 
424
- H4("1.1 Text Extraction"),
425
  P("""
426
  Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
427
  WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
@@ -442,7 +458,7 @@ def web_data():
442
  ),
443
  #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
444
 
445
- H4("1.2 Language Identification"),
446
  P("""
447
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
448
  This step removes over 60% of the whole data.
@@ -461,16 +477,16 @@ def web_data():
461
  DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
462
  ),
463
 
464
- H4("1.3 URL Filtering"),
465
  P("""
466
- Following RefinedWeb [3], we use a manually inspected URL blocklist to filter fraudulent and/or adult websites.
467
- We also exclude our high-quality curated data from it to avoid duplication.
468
  """),
469
- H5("1.3.1 URL Blocklist"),
470
  P("""
471
- Following RefinedWeb [3], we applied manual inspection on the UT1 blocklist to reduce false positives like news
472
  articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
473
- 4.6M domain names in the UT1 blocklist. 24 URL domains were detected with more than 4k matches, which are shown below.
474
  """),
475
 
476
  Details(
@@ -495,7 +511,7 @@ def web_data():
495
  ),
496
  ),
497
 
498
- H5("1.3.2 Excluded High Quality Sources"),
499
  P("""
500
  To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
501
  """),
@@ -514,20 +530,23 @@ def web_data():
514
  ),
515
 
516
 
517
- H3("2. Line-Level Removal"),
518
  P("""
519
- Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
520
- removal to remove low-quality lines so that the final quality signals align with our final kept texts.
521
  """),
522
- H4("Terminal Punctuation"),
523
  P("""
524
  The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
525
  punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
526
- lines, especially when using a better text extraction tool “trafilatura”. For instance, in the file
 
 
 
527
  CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
528
  of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
529
  documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
530
- """),
531
 
532
  Details(
533
  Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
@@ -539,14 +558,17 @@ def web_data():
539
  ),
540
 
541
 
542
- H4('2.1 Word "Javascript"'),
543
  P("""
544
  In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
545
  pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
546
- strict, which will filter out many lines that are really talking about “Javascript”. In our pipeline, we
 
 
 
547
  propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
548
  The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
549
- """),
550
  Details(
551
  Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
552
  DV(
@@ -555,14 +577,16 @@ def web_data():
555
  "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
556
  ),
557
  ),
558
- H4("2.2 Other Rules from RefinedWeb"),
559
  P("""
560
  We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
561
- - The line is only composed of uppercase characters,
562
- - The line is only composed of numerical characters,
563
- - The line matches the pattern “r'^\\d+\\s+likes$'”,
564
- - The line contains only one word.
565
  """),
 
 
 
 
 
 
566
  Details(
567
  Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
568
  DV(
@@ -571,7 +595,7 @@ def web_data():
571
  "Sample documents with lines that are removed by the RefinedWeb rules",
572
  ),
573
  ),
574
- H4("2.3 Toxic Lines"),
575
  P("""
576
  When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
577
  document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
@@ -587,10 +611,10 @@ def web_data():
587
  ),
588
  ),
589
 
590
- H3("3. Document-Level Filtering"),
591
  P("""
592
- In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
593
- Overview of all the quality signals that are used for filtering."""),
594
  Details(
595
  Summary("Overview of all the quality signals that are used for filtering"),
596
  DVS(
@@ -599,21 +623,21 @@ def web_data():
599
  ),
600
  ),
601
  P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
602
- Most of these quality signals were initially introduced by Gopher [2] and subsequently adopted by later
603
  studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
604
  of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
605
  outcomes for the same quality signals.
606
  In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
607
- and RedPajama V2 [7], selecting the most suitable method based on manual inspections.
608
  """),
609
- H4("3.1 Repetition-based Heuristics"),
610
  P("""
611
- Due to crawling errors or low-quality sources, many documents contain repeated sequences. In line with previous
612
  work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
613
  """),
614
- H5("3.1.1 Fraction of (Characters in) Repeated Lines"),
615
  P("""
616
- Following Gopher [2], we remove documents containing many short duplicate passages, as well as those with few,
617
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
618
  that are duplicates, and the fraction of characters contained within those duplicated passages.
619
  """),
@@ -674,24 +698,24 @@ def web_data():
674
  After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
675
  signals), we have made the following decisions:
676
  """),
677
- H5("Passage Separation"),
678
  P("""
679
  Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
680
  symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
681
  one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
682
  opting instead to use a single newline symbol to segment the text into passages.
683
  """),
684
- H5("First Occurrence"),
685
  P("""
686
  In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
687
  helps retain a larger number of documents.
688
  """),
689
- H5("Character Count"),
690
  P("""
691
  We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
692
  ensures consistency with the overall document character count calculation.
693
  """),
694
- H5("Our Implementation"),
695
  Details(
696
  Summary("TxT360 Implementation"),
697
  D_code("""
@@ -719,7 +743,7 @@ def web_data():
719
  "Sample documents filtered by excessive line repetitions / characters in repeated lines",
720
  ),
721
  ),
722
- H5("3.1.2 Fraction of Characters in the Most Common N-grams (n=2,3,4)"),
723
  P("""
724
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
725
  fraction of characters contained within the most frequently-occurring n-gram.
@@ -804,7 +828,7 @@ def web_data():
804
  """, block="block", language="python"),
805
  ),
806
  P("""
807
- There are almost no contradictions between above implementations of fractions of characters in the most common
808
  n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
809
  fraction is then determined by dividing the number of characters in the most common n-gram by the total number of
810
  characters. One minor difference is that Dolma and DataTrove calculate the fraction of the most common n-gram even
@@ -838,7 +862,7 @@ def web_data():
838
  "Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)",
839
  ),
840
  ),
841
- H5("3.1.3 Fraction of Characters in Duplicated N-grams (n=5,...,10)"),
842
  P("""
843
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
844
  fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
@@ -1020,7 +1044,7 @@ def web_data():
1020
  "Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
1021
  ),
1022
  ),
1023
- H4("3.2 Line-wise Heuristics"),
1024
  P("""
1025
  Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
1026
  RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
@@ -1101,7 +1125,7 @@ def web_data():
1101
  ),
1102
  ),
1103
 
1104
- H4("3.3 Statistics-based Heuristics"),
1105
  P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
1106
  Ul(
1107
  Li("the word count in the document", style = "margin-bottom: 5px"),
@@ -1120,7 +1144,7 @@ def web_data():
1120
  Li("the words that contain at least one alphabetic character are less than 80% of the whole words", style = "margin-bottom: 5px"),
1121
  Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with", style = "margin-bottom: 5px"),
1122
  ),
1123
- H5("Word Count"),
1124
  Details(
1125
  Summary("Implementations from Dolma"),
1126
  D_code("""
@@ -1178,7 +1202,7 @@ def web_data():
1178
  We decided to use simple `len(text.split())` to compute the word count.
1179
  """),
1180
 
1181
- H5("Mean Word Length"),
1182
  P("""
1183
  There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
1184
  """),
@@ -1189,13 +1213,13 @@ def web_data():
1189
  mean_word_length = character_count / word_count
1190
  """, block="block", language="python"),
1191
  P("""
1192
- It's worth noting that Dolma used the median word length instead of the mean in their codes.
1193
  """),
1194
  D_code("""
1195
  from statistics import median
1196
  median_word_length = median(len(word) for word in words)
1197
  """, block="block", language="python"),
1198
- H5("Number of Sentences"),
1199
  P("""
1200
  The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
1201
  to split text into sentences.
@@ -1232,7 +1256,7 @@ def web_data():
1232
  """, block="block", language="python"),
1233
  ),
1234
 
1235
- H5("Symbol to Word Ratio"),
1236
  P("""
1237
  Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
1238
  We calculate the ratio as the number of symbols divided by the total number of words.
@@ -1294,7 +1318,7 @@ def web_data():
1294
  """, block="block", language="python"),
1295
  ),
1296
 
1297
- H5("Fraction of Alphabetic Words"),
1298
  Details(
1299
  Summary("Implementations from Dolma"),
1300
  D_code("""
@@ -1355,7 +1379,7 @@ def web_data():
1355
  attrs.num_of_stop_words = sum(1 for word in words if stop_words_pattern.search(word))
1356
 
1357
  """, block="block", language="python"),
1358
- H5("Our Implementations"),
1359
  Details(
1360
  Summary("Sample documents that are filtered out by statistics-based heuristics"),
1361
  DV(
@@ -1364,7 +1388,7 @@ def web_data():
1364
  "Sample documents that are filtered out by statistics-based heuristics",
1365
  ),
1366
  ),
1367
- H4("3.4 Others"),
1368
  P("""
1369
  Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
1370
  text.
@@ -1374,7 +1398,7 @@ def web_data():
1374
  Summary("Sample documents containing 'lorem ipsum'"),
1375
  DV("data/lorem_ipsum.json", 0, "Sample documents containing 'lorem ipsum'"),
1376
  ),
1377
- H3("4. Deduplication"),
1378
  P("""
1379
  After careful filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, to boilerplate content being repeated (e.g., licences), or even to plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization.
1380
  """), # Add detailed content and images as needed
@@ -1383,6 +1407,6 @@ def web_data():
1383
  P("To reduce the expensive cost of global deduplication, we apply a local exact deduplication before it. Specifically, each dump is split into 70 splits. A bloom filter is applied within each split."),
1384
  P(B("Global Fuzzy Deduplication")),
1385
  P("NEED TO UPDATE"),
1386
- H3("5. PII Removal"),
1387
  P("..."), # Add detailed content and images as needed
1388
  )
 
352
 
353
  def web_data():
354
  return Div(
355
+ Div(
356
+ H2("Common Crawl Snapshot Processing"),
357
+ H3("What This Section Contains"),
358
+ P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
359
+ Ul(
360
+ Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
361
+ Li("Document Preperation", style = "margin-bottom: 5px"),
362
+ Li("Line-Level Filtering", style = "margin-bottom: 5px"),
363
+ Li("Local Deduplication", style = "margin-bottom: 5px"),
364
+ Li("Each section is complete with code and comparisons to Dolma, DataTrove, and/or RedPajama-V-2", style = "margin-bottom: 5px"),
365
+ ),
366
+ ),
367
+ Div
368
+ H2("Common Crawl Data Processing Summary"),
369
+ Div(
370
+ P(
371
+ "To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Starting from ",
372
+ A("Common Crawl", href="https://commoncrawl.org/"),
373
+ ", our process can be summarized as five main steps: document preparation, line-level removal, document-level filtering, deduplication and PII removal.",
374
+ ),
375
+ style="margin-top: 20px;",
376
+ ),
377
  Div(
378
  Ul(
379
  Li(
 
396
  padding: 15px 15px 0px 15px;
397
  """,
398
  ),
399
+ H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets")
 
 
 
 
 
 
 
 
400
  P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
401
  table_div_filter_data,
402
+ P("The table below provides a comparison of the quality filters that have been applied to each dataset."),
403
  table_div_qf_filter_data,
404
  P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
405
  Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
406
  P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
407
+ H3("TxT360 Filter Summary")
408
+ P("This section provides highlevel details into the filtering that is applied to CommonCrawl in TxT360. Each decision listed is discussed in detail further on in this section.")
409
+ P("We adopt rules from RefinedWeb [1] to remove lines if they satisfy any of the following criteria:"),
410
  Ul(
411
  Li("the line is only composed of uppercase characters", style = "margin-bottom: 5px"),
412
  Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
 
435
  P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
436
 
437
 
438
+ H2("1. Document Preparation"),
439
 
440
+ H3("1.1 Text Extraction"),
441
  P("""
442
  Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
443
  WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
 
458
  ),
459
  #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
460
 
461
+ H3("1.2 Language Identification"),
462
  P("""
463
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
464
  This step removes over 60% of the whole data.
 
477
  DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
478
  ),
479
 
480
+ H3("1.3 URL Filtering"),
481
  P("""
482
+ The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
483
+ out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated curated data (e.g. wikipedia.org) to avoid duplication.
484
  """),
485
+ H3("1.3.1 URL Blocklist"),
486
  P("""
487
+ Following RefinedWeb [3], we manually inspected the UT1 blocklist to reduce false positives like news
488
  articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
489
+ 4.6M domain names in the UT1 blocklist. Of note, 24 URLs were detected with more than 4k matches and are shown below.
490
  """),
491
 
492
  Details(
 
511
  ),
512
  ),
513
 
514
+ H3("1.3.2 Excluded High Quality Sources"),
515
  P("""
516
  To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
517
  """),
 
530
  ),
531
 
532
 
533
+ H2("2. Line-Level Removal"),
534
  P("""
535
+ Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
536
+ This ensured that computing quality signals would align with the final kept texts.
537
  """),
538
+ H3("Terminal Punctuation"),
539
  P("""
540
  The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
541
  punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
542
+ lines, especially when the text extraction tool “trafilatura”.
543
+ """),
544
+ P("""
545
+ For instance, in the CommonCrawl file
546
  CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
547
  of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
548
  documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
549
+ """)
550
 
551
  Details(
552
  Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
 
558
  ),
559
 
560
 
561
+ H3('2.1 Word "Javascript"'),
562
  P("""
563
  In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
564
  pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
565
+ strict, which will filter out many lines that are really talking about “Javascript”.
566
+ """),
567
+ P("""
568
+ In our pipeline, we
569
  propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
570
  The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
571
+ """)
572
  Details(
573
  Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
574
  DV(
 
577
  "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
578
  ),
579
  ),
580
+ H3("2.2 Other Rules from RefinedWeb"),
581
  P("""
582
  We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
 
 
 
 
583
  """),
584
+ Ul(
585
+ Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
586
+ Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
587
+ Li("the line matches the pattern “r'^\d+\s+likes$", style = "margin-bottom: 5px"),
588
+ Li("the line only contains one word.", style = "margin-bottom: 5px"),
589
+ ),
590
  Details(
591
  Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
592
  DV(
 
595
  "Sample documents with lines that are removed by the RefinedWeb rules",
596
  ),
597
  ),
598
+ H3("2.3 Toxic Lines"),
599
  P("""
600
  When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
601
  document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
 
611
  ),
612
  ),
613
 
614
+ H2("3. Document-Level Filtering"),
615
  P("""
616
+ In this section, we introduce each quality signal used to filter out low-quality documents.
617
+ """),
618
  Details(
619
  Summary("Overview of all the quality signals that are used for filtering"),
620
  DVS(
 
623
  ),
624
  ),
625
  P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
626
+ Most quality signals were initially introduced by Gopher [2] and subsequently adopted by later
627
  studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
628
  of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
629
  outcomes for the same quality signals.
630
  In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
631
+ and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
632
  """),
633
+ H3("3.1 Repetition-based Heuristics"),
634
  P("""
635
+ Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
636
  work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
637
  """),
638
+ H3("3.1.1 Fraction of (Characters in) Repeated Lines"),
639
  P("""
640
+ Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
641
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
642
  that are duplicates, and the fraction of characters contained within those duplicated passages.
643
  """),
 
698
  After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
699
  signals), we have made the following decisions:
700
  """),
701
+ H3("Passage Separation"),
702
  P("""
703
  Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
704
  symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
705
  one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
706
  opting instead to use a single newline symbol to segment the text into passages.
707
  """),
708
+ H3("First Occurrence"),
709
  P("""
710
  In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
711
  helps retain a larger number of documents.
712
  """),
713
+ H3("Character Count"),
714
  P("""
715
  We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
716
  ensures consistency with the overall document character count calculation.
717
  """),
718
+ H3("TxT360 Implementation"),
719
  Details(
720
  Summary("TxT360 Implementation"),
721
  D_code("""
 
743
  "Sample documents filtered by excessive line repetitions / characters in repeated lines",
744
  ),
745
  ),
746
+ H3("3.1.2 Fraction of Characters in the Most Common N-grams (n=2,3,4)"),
747
  P("""
748
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
749
  fraction of characters contained within the most frequently-occurring n-gram.
 
828
  """, block="block", language="python"),
829
  ),
830
  P("""
831
+ There are almost no contradictions between each implementations of fractions of characters in the most common
832
  n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
833
  fraction is then determined by dividing the number of characters in the most common n-gram by the total number of
834
  characters. One minor difference is that Dolma and DataTrove calculate the fraction of the most common n-gram even
 
862
  "Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)",
863
  ),
864
  ),
865
+ H3("3.1.3 Fraction of Characters in Duplicated N-grams (n=5,...,10)"),
866
  P("""
867
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
868
  fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
 
1044
  "Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
1045
  ),
1046
  ),
1047
+ H3("3.2 Line-wise Heuristics"),
1048
  P("""
1049
  Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
1050
  RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
 
1125
  ),
1126
  ),
1127
 
1128
+ H3("3.3 Statistics-based Heuristics"),
1129
  P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
1130
  Ul(
1131
  Li("the word count in the document", style = "margin-bottom: 5px"),
 
1144
  Li("the words that contain at least one alphabetic character are less than 80% of the whole words", style = "margin-bottom: 5px"),
1145
  Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with", style = "margin-bottom: 5px"),
1146
  ),
1147
+ H3("Word Count"),
1148
  Details(
1149
  Summary("Implementations from Dolma"),
1150
  D_code("""
 
1202
  We decided to use simple `len(text.split())` to compute the word count.
1203
  """),
1204
 
1205
+ H3("Mean Word Length"),
1206
  P("""
1207
  There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
1208
  """),
 
1213
  mean_word_length = character_count / word_count
1214
  """, block="block", language="python"),
1215
  P("""
1216
+ It's worth noting that Dolma used the median word length instead of the mean:
1217
  """),
1218
  D_code("""
1219
  from statistics import median
1220
  median_word_length = median(len(word) for word in words)
1221
  """, block="block", language="python"),
1222
+ H3("Number of Sentences"),
1223
  P("""
1224
  The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
1225
  to split text into sentences.
 
1256
  """, block="block", language="python"),
1257
  ),
1258
 
1259
+ H3("Symbol to Word Ratio"),
1260
  P("""
1261
  Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
1262
  We calculate the ratio as the number of symbols divided by the total number of words.
 
1318
  """, block="block", language="python"),
1319
  ),
1320
 
1321
+ H3("Fraction of Alphabetic Words"),
1322
  Details(
1323
  Summary("Implementations from Dolma"),
1324
  D_code("""
 
1379
  attrs.num_of_stop_words = sum(1 for word in words if stop_words_pattern.search(word))
1380
 
1381
  """, block="block", language="python"),
1382
+ H3("TxT360 Implementation"),
1383
  Details(
1384
  Summary("Sample documents that are filtered out by statistics-based heuristics"),
1385
  DV(
 
1388
  "Sample documents that are filtered out by statistics-based heuristics",
1389
  ),
1390
  ),
1391
+ H3("3.4 Others"),
1392
  P("""
1393
  Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
1394
  text.
 
1398
  Summary("Sample documents containing 'lorem ipsum'"),
1399
  DV("data/lorem_ipsum.json", 0, "Sample documents containing 'lorem ipsum'"),
1400
  ),
1401
+ H2("4. Deduplication"),
1402
  P("""
1403
  After careful filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, to boilerplate content being repeated (e.g., licences), or even to plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization.
1404
  """), # Add detailed content and images as needed
 
1407
  P("To reduce the expensive cost of global deduplication, we apply a local exact deduplication before it. Specifically, each dump is split into 70 splits. A bloom filter is applied within each split."),
1408
  P(B("Global Fuzzy Deduplication")),
1409
  P("NEED TO UPDATE"),
1410
+ H2("5. PII Removal"),
1411
  P("..."), # Add detailed content and images as needed
1412
  )