omwdataset

Running

App Files Files Community

victormiller commited on 15 days ago

Commit

7b420f4

•

1 Parent(s): dcb73ca

Update web.py

Browse files

Files changed (1) hide show

web.py +88 -64

web.py CHANGED Viewed

@@ -352,6 +352,28 @@ attrs.fraction_of_characters_in_duplicate_lines = sum(
 def web_data():
     return Div(
         Div(
             Ul(
                 Li(
@@ -374,23 +396,17 @@ def web_data():
             padding: 15px 15px 0px 15px;
         """,
         ),
-        Div(
-            P(
-                "To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Starting from ",
-                A("Common Crawl", href="https://commoncrawl.org/"),
-                ", our process can be summarized as five main steps: document preparation, line-level removal, document-level filtering, deduplication and PII removal.",
-            ),
-            style="margin-top: 20px;",
-        ),
-        H2("Web Data Processing Summary"),
         P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
         table_div_filter_data,
-        P("ADD EXPLAINER TEXT ABOUT THE QUALITY FILTERS"),
         table_div_qf_filter_data,
         P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
         Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
         P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
-        P("We also adopt rules from RefinedWeb [1] to remove lines if they satisfy any of the following criteria:"),
         Ul(
             Li("the line is only composed of uppercase characters", style = "margin-bottom: 5px"),
             Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
@@ -419,9 +435,9 @@ def web_data():
         P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
-        H3("1. Document Preparation"),
-        H4("1.1 Text Extraction"),
         P("""
         Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
         WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
@@ -442,7 +458,7 @@ def web_data():
             ),
         #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
-        H4("1.2 Language Identification"),
         P("""
         After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
         This step removes over 60% of the whole data.
@@ -461,16 +477,16 @@ def web_data():
             DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
         ),
-        H4("1.3 URL Filtering"),
         P("""
-        Following RefinedWeb [3], we use a manually inspected URL blocklist to filter fraudulent and/or adult websites.
-        We also exclude our high-quality curated data from it to avoid duplication.
         """),
-        H5("1.3.1 URL Blocklist"),
         P("""
-        Following RefinedWeb [3], we applied manual inspection on the UT1 blocklist to reduce false positives like news
         articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
-        4.6M domain names in the UT1 blocklist. 24 URL domains were detected with more than 4k matches, which are shown below.
         """),
         Details(
@@ -495,7 +511,7 @@ def web_data():
             ),
         ),
-        H5("1.3.2 Excluded High Quality Sources"),
         P("""
         To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
         """),
@@ -514,20 +530,23 @@ def web_data():
         ),
-        H3("2. Line-Level Removal"),
         P("""
-        Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
-        removal to remove low-quality lines so that the final quality signals align with our final kept texts.
         """),
-        H4("Terminal Punctuation"),
         P("""
         The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
         punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
-        lines, especially when using a better text extraction tool “trafilatura”. For instance, in the file
         CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
         of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
         documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
-        """),
         Details(
             Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
@@ -539,14 +558,17 @@ def web_data():
         ),
-        H4('2.1 Word "Javascript"'),
         P("""
         In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
         pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
-        strict, which will filter out many lines that are really talking about “Javascript”. In our pipeline, we
         propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
         The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
-        """),
         Details(
             Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
             DV(
@@ -555,14 +577,16 @@ def web_data():
                 "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
             ),
         ),
-        H4("2.2 Other Rules from RefinedWeb"),
         P("""
         We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
-        - The line is only composed of uppercase characters,
-        - The line is only composed of numerical characters,
-        - The line matches the pattern “r'^\\d+\\s+likes$'”,
-        - The line contains only one word.
         """),
         Details(
             Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
             DV(
@@ -571,7 +595,7 @@ def web_data():
                 "Sample documents with lines that are removed by the RefinedWeb rules",
             ),
         ),
-        H4("2.3 Toxic Lines"),
         P("""
         When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
         document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
@@ -587,10 +611,10 @@ def web_data():
             ),
         ),
-        H3("3. Document-Level Filtering"),
         P("""
-        In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
-        Overview of all the quality signals that are used for filtering."""),
         Details(
             Summary("Overview of all the quality signals that are used for filtering"),
             DVS(
@@ -599,21 +623,21 @@ def web_data():
             ),
         ),
         P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
-        Most of these quality signals were initially introduced by Gopher [2] and subsequently adopted by later
         studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
         of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
         outcomes for the same quality signals.
         In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
-        and RedPajama V2 [7], selecting the most suitable method based on manual inspections.
         """),
-        H4("3.1 Repetition-based Heuristics"),
         P("""
-        Due to crawling errors or low-quality sources, many documents contain repeated sequences. In line with previous
         work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
         """),
-        H5("3.1.1 Fraction of (Characters in) Repeated Lines"),
         P("""
-        Following Gopher [2], we remove documents containing many short duplicate passages, as well as those with few,
         but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
         that are duplicates, and the fraction of characters contained within those duplicated passages.
         """),
@@ -674,24 +698,24 @@ def web_data():
         After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
         signals), we have made the following decisions:
         """),
-        H5("Passage Separation"),
         P("""
         Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
         symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
         one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
         opting instead to use a single newline symbol to segment the text into passages.
         """),
-        H5("First Occurrence"),
         P("""
         In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
         helps retain a larger number of documents.
         """),
-        H5("Character Count"),
         P("""
         We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
         ensures consistency with the overall document character count calculation.
         """),
-        H5("Our Implementation"),
         Details(
             Summary("TxT360 Implementation"),
             D_code("""
@@ -719,7 +743,7 @@ def web_data():
                 "Sample documents filtered by excessive line repetitions / characters in repeated lines",
             ),
         ),
-        H5("3.1.2 Fraction of Characters in the Most Common N-grams (n=2,3,4)"),
         P("""
         Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
         fraction of characters contained within the most frequently-occurring n-gram.
@@ -804,7 +828,7 @@ def web_data():
             """, block="block", language="python"),
         ),
         P("""
-        There are almost no contradictions between above implementations of fractions of characters in the most common
         n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
         fraction is then determined by dividing the number of characters in the most common n-gram by the total number of
         characters. One minor difference is that Dolma and DataTrove calculate the fraction of the most common n-gram even
@@ -838,7 +862,7 @@ def web_data():
                 "Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)",
             ),
         ),
-        H5("3.1.3 Fraction of Characters in Duplicated N-grams (n=5,...,10)"),
         P("""
         Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
         fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
@@ -1020,7 +1044,7 @@ def web_data():
                 "Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
             ),
         ),
-        H4("3.2 Line-wise Heuristics"),
         P("""
         Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
         RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
@@ -1101,7 +1125,7 @@ def web_data():
             ),
         ),
-        H4("3.3 Statistics-based Heuristics"),
         P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
         Ul(
             Li("the word count in the document", style = "margin-bottom: 5px"),
@@ -1120,7 +1144,7 @@ def web_data():
             Li("the words that contain at least one alphabetic character are less than 80% of the whole words", style = "margin-bottom: 5px"),
             Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with", style = "margin-bottom: 5px"),
         ),
-        H5("Word Count"),
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
@@ -1178,7 +1202,7 @@ def web_data():
         We decided to use simple `len(text.split())` to compute the word count.
         """),
-        H5("Mean Word Length"),
         P("""
         There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
         """),
@@ -1189,13 +1213,13 @@ def web_data():
                 mean_word_length = character_count / word_count
             """, block="block", language="python"),
         P("""
-        It's worth noting that Dolma used the median word length instead of the mean in their codes.
         """),
         D_code("""
                 from statistics import median
                 median_word_length = median(len(word) for word in words)
             """, block="block", language="python"),
-        H5("Number of Sentences"),
         P("""
         The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
         to split text into sentences.
@@ -1232,7 +1256,7 @@ def web_data():
             """, block="block", language="python"),
         ),
-        H5("Symbol to Word Ratio"),
         P("""
         Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
         We calculate the ratio as the number of symbols divided by the total number of words.
@@ -1294,7 +1318,7 @@ def web_data():
             """, block="block", language="python"),
         ),
-        H5("Fraction of Alphabetic Words"),
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
@@ -1355,7 +1379,7 @@ def web_data():
         attrs.num_of_stop_words = sum(1 for word in words if stop_words_pattern.search(word))
         """, block="block", language="python"),
-        H5("Our Implementations"),
         Details(
             Summary("Sample documents that are filtered out by statistics-based heuristics"),
             DV(
@@ -1364,7 +1388,7 @@ def web_data():
                 "Sample documents that are filtered out by statistics-based heuristics",
             ),
         ),
-        H4("3.4 Others"),
         P("""
         Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
         text.
@@ -1374,7 +1398,7 @@ def web_data():
             Summary("Sample documents containing 'lorem ipsum'"),
             DV("data/lorem_ipsum.json", 0, "Sample documents containing 'lorem ipsum'"),
         ),
-        H3("4. Deduplication"),
         P("""
         After careful filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, to boilerplate content being repeated (e.g., licences), or even to plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization.
         """),  # Add detailed content and images as needed
@@ -1383,6 +1407,6 @@ def web_data():
         P("To reduce the expensive cost of global deduplication, we apply a local exact deduplication before it. Specifically, each dump is split into 70 splits. A bloom filter is applied within each split."),
         P(B("Global Fuzzy Deduplication")),
         P("NEED TO UPDATE"),
-        H3("5. PII Removal"),
         P("..."),  # Add detailed content and images as needed
     )

 def web_data():
     return Div(
+        Div(
+        H2("Common Crawl Snapshot Processing"),
+        H3("What This Section Contains"),
+        P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
+        Ul(
+            Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
+            Li("Document Preperation", style = "margin-bottom: 5px"),
+            Li("Line-Level Filtering", style = "margin-bottom: 5px"),
+            Li("Local Deduplication", style = "margin-bottom: 5px"),
+            Li("Each section is complete with code and comparisons to Dolma, DataTrove, and/or RedPajama-V-2", style = "margin-bottom: 5px"),
+        ),
+        ),
+        Div
+        H2("Common Crawl Data Processing Summary"),
+        Div(
+            P(
+                "To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Starting from ",
+                A("Common Crawl", href="https://commoncrawl.org/"),
+                ", our process can be summarized as five main steps: document preparation, line-level removal, document-level filtering, deduplication and PII removal.",
+            ),
+            style="margin-top: 20px;",
+        ),
         Div(
             Ul(
                 Li(
             padding: 15px 15px 0px 15px;
         """,
         ),
+        H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets")
         P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
         table_div_filter_data,
+        P("The table below provides a comparison of the quality filters that have been applied to each dataset."),
         table_div_qf_filter_data,
         P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
         Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
         P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
+        H3("TxT360 Filter Summary")
+        P("This section provides highlevel details into the filtering that is applied to CommonCrawl in TxT360. Each decision listed is discussed in detail further on in this section.")
+        P("We adopt rules from RefinedWeb [1] to remove lines if they satisfy any of the following criteria:"),
         Ul(
             Li("the line is only composed of uppercase characters", style = "margin-bottom: 5px"),
             Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
         P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
+        H2("1. Document Preparation"),
+        H3("1.1 Text Extraction"),
         P("""
         Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
         WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
             ),
         #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
+        H3("1.2 Language Identification"),
         P("""
         After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
         This step removes over 60% of the whole data.
             DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
         ),
+        H3("1.3 URL Filtering"),
         P("""
+        The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
+        out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated curated data (e.g. wikipedia.org) to avoid duplication.
         """),
+        H3("1.3.1 URL Blocklist"),
         P("""
+        Following RefinedWeb [3], we manually inspected the UT1 blocklist to reduce false positives like news
         articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
+        4.6M domain names in the UT1 blocklist. Of note, 24 URLs were detected with more than 4k matches and are shown below.
         """),
         Details(
             ),
         ),
+        H3("1.3.2 Excluded High Quality Sources"),
         P("""
         To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
         """),
         ),
+        H2("2. Line-Level Removal"),
         P("""
+        Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
+        This ensured that computing quality signals would align with the final kept texts.
         """),
+        H3("Terminal Punctuation"),
         P("""
         The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
         punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
+        lines, especially when the text extraction tool “trafilatura”.
+        """),
+        P("""
+        For instance, in the CommonCrawl file
         CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
         of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
         documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
+        """)
         Details(
             Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
         ),
+        H3('2.1 Word "Javascript"'),
         P("""
         In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
         pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
+        strict, which will filter out many lines that are really talking about “Javascript”.
+        """),
+        P("""
+        In our pipeline, we
         propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
         The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
+        """)
         Details(
             Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
             DV(
                 "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
             ),
         ),
+        H3("2.2 Other Rules from RefinedWeb"),
         P("""
         We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
         """),
+        Ul(
+            Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
+            Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
+            Li("the line matches the pattern “r'^\d+\s+likes$", style = "margin-bottom: 5px"),
+            Li("the line only contains one word.", style = "margin-bottom: 5px"),
+        ),
         Details(
             Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
             DV(
                 "Sample documents with lines that are removed by the RefinedWeb rules",
             ),
         ),
+        H3("2.3 Toxic Lines"),
         P("""
         When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
         document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
             ),
         ),
+        H2("3. Document-Level Filtering"),
         P("""
+        In this section, we introduce each quality signal used to filter out low-quality documents.
+        """),
         Details(
             Summary("Overview of all the quality signals that are used for filtering"),
             DVS(
             ),
         ),
         P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
+        Most quality signals were initially introduced by Gopher [2] and subsequently adopted by later
         studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
         of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
         outcomes for the same quality signals.
         In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
+        and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
         """),
+        H3("3.1 Repetition-based Heuristics"),
         P("""
+        Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
         work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
         """),
+        H3("3.1.1 Fraction of (Characters in) Repeated Lines"),
         P("""
+        Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
         but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
         that are duplicates, and the fraction of characters contained within those duplicated passages.
         """),
         After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
         signals), we have made the following decisions:
         """),
+        H3("Passage Separation"),
         P("""
         Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
         symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
         one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
         opting instead to use a single newline symbol to segment the text into passages.
         """),
+        H3("First Occurrence"),
         P("""
         In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
         helps retain a larger number of documents.
         """),
+        H3("Character Count"),
         P("""
         We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
         ensures consistency with the overall document character count calculation.
         """),
+        H3("TxT360 Implementation"),
         Details(
             Summary("TxT360 Implementation"),
             D_code("""
                 "Sample documents filtered by excessive line repetitions / characters in repeated lines",
             ),
         ),
+        H3("3.1.2 Fraction of Characters in the Most Common N-grams (n=2,3,4)"),
         P("""
         Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
         fraction of characters contained within the most frequently-occurring n-gram.
             """, block="block", language="python"),
         ),
         P("""
+        There are almost no contradictions between each implementations of fractions of characters in the most common
         n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
         fraction is then determined by dividing the number of characters in the most common n-gram by the total number of
         characters. One minor difference is that Dolma and DataTrove calculate the fraction of the most common n-gram even
                 "Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)",
             ),
         ),
+        H3("3.1.3 Fraction of Characters in Duplicated N-grams (n=5,...,10)"),
         P("""
         Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
         fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
                 "Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
             ),
         ),
+        H3("3.2 Line-wise Heuristics"),
         P("""
         Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
         RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
             ),
         ),
+        H3("3.3 Statistics-based Heuristics"),
         P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
         Ul(
             Li("the word count in the document", style = "margin-bottom: 5px"),
             Li("the words that contain at least one alphabetic character are less than 80% of the whole words", style = "margin-bottom: 5px"),
             Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with", style = "margin-bottom: 5px"),
         ),
+        H3("Word Count"),
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
         We decided to use simple `len(text.split())` to compute the word count.
         """),
+        H3("Mean Word Length"),
         P("""
         There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
         """),
                 mean_word_length = character_count / word_count
             """, block="block", language="python"),
         P("""
+        It's worth noting that Dolma used the median word length instead of the mean:
         """),
         D_code("""
                 from statistics import median
                 median_word_length = median(len(word) for word in words)
             """, block="block", language="python"),
+        H3("Number of Sentences"),
         P("""
         The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
         to split text into sentences.
             """, block="block", language="python"),
         ),
+        H3("Symbol to Word Ratio"),
         P("""
         Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
         We calculate the ratio as the number of symbols divided by the total number of words.
             """, block="block", language="python"),
         ),
+        H3("Fraction of Alphabetic Words"),
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
         attrs.num_of_stop_words = sum(1 for word in words if stop_words_pattern.search(word))
         """, block="block", language="python"),
+        H3("TxT360 Implementation"),
         Details(
             Summary("Sample documents that are filtered out by statistics-based heuristics"),
             DV(
                 "Sample documents that are filtered out by statistics-based heuristics",
             ),
         ),
+        H3("3.4 Others"),
         P("""
         Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
         text.
             Summary("Sample documents containing 'lorem ipsum'"),
             DV("data/lorem_ipsum.json", 0, "Sample documents containing 'lorem ipsum'"),
         ),
+        H2("4. Deduplication"),
         P("""
         After careful filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, to boilerplate content being repeated (e.g., licences), or even to plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization.
         """),  # Add detailed content and images as needed
         P("To reduce the expensive cost of global deduplication, we apply a local exact deduplication before it. Specifically, each dump is split into 70 splits. A bloom filter is applied within each split."),
         P(B("Global Fuzzy Deduplication")),
         P("NEED TO UPDATE"),
+        H2("5. PII Removal"),
         P("..."),  # Add detailed content and images as needed
     )