victormiller
commited on
Commit
•
ad46307
1
Parent(s):
4d5ad99
Update web.py
Browse files
web.py
CHANGED
@@ -499,7 +499,7 @@ def web_data():
|
|
499 |
|
500 |
|
501 |
P(B('"Word "Javascript"'), """
|
502 |
-
In C4
|
503 |
pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
|
504 |
strict, which will filter out many lines that are really talking about “Javascript”.
|
505 |
"""),
|
@@ -526,7 +526,7 @@ def web_data():
|
|
526 |
""",
|
527 |
),
|
528 |
P(B("Other Rules from RefinedWeb: "), """
|
529 |
-
We also adopt rules from RefinedWeb
|
530 |
"""),
|
531 |
Ul(
|
532 |
Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
|
@@ -597,19 +597,19 @@ def web_data():
|
|
597 |
""",
|
598 |
),
|
599 |
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
600 |
-
Most quality signals were initially introduced by Gopher
|
601 |
-
studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
602 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
603 |
outcomes for the same quality signals.
|
604 |
-
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma
|
605 |
-
and RedPajama V2
|
606 |
"""),
|
607 |
P(B("Repetition-based Heuristics: "), """
|
608 |
Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
|
609 |
-
work (
|
610 |
"""),
|
611 |
P(B("Fraction of Characters in Repeated Lines: "), """
|
612 |
-
Following Gopher
|
613 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
614 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
615 |
"""),
|
@@ -748,7 +748,7 @@ def web_data():
|
|
748 |
""",
|
749 |
),
|
750 |
P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
|
751 |
-
Following Gopher
|
752 |
fraction of characters contained within the most frequently-occurring n-gram.
|
753 |
"""),
|
754 |
Details(
|
@@ -911,7 +911,7 @@ def web_data():
|
|
911 |
""",
|
912 |
),
|
913 |
P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
|
914 |
-
Following Gopher
|
915 |
fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
|
916 |
overlapping n-grams more than once.
|
917 |
"""),
|
@@ -1141,8 +1141,8 @@ def web_data():
|
|
1141 |
),
|
1142 |
P(B("Line-wise Heuristics: "), """
|
1143 |
Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
|
1144 |
-
RefinedWeb
|
1145 |
-
works
|
1146 |
90% of lines start with a bullet point.
|
1147 |
"""),
|
1148 |
Details(
|
@@ -1247,7 +1247,7 @@ def web_data():
|
|
1247 |
),
|
1248 |
|
1249 |
P(B("Statistics-based Heuristics: "), """
|
1250 |
-
We summarize other statistics-based rules originated from Gopher
|
1251 |
"""),
|
1252 |
Ul(
|
1253 |
Li("the word count in the document", style = "margin-bottom: 5px"),
|
|
|
499 |
|
500 |
|
501 |
P(B('"Word "Javascript"'), """
|
502 |
+
In C4,""", D_cite(bibtex_key="c4"), """the authors remove any line with the word "Javascript" since they found that many of the scraped
|
503 |
pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
|
504 |
strict, which will filter out many lines that are really talking about “Javascript”.
|
505 |
"""),
|
|
|
526 |
""",
|
527 |
),
|
528 |
P(B("Other Rules from RefinedWeb: "), """
|
529 |
+
We also adopt rules from RefinedWeb """, D_cite(bibtex_key="refinedweb"), """ to remove lines if they satisfy any of the following criteria:
|
530 |
"""),
|
531 |
Ul(
|
532 |
Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
|
|
|
597 |
""",
|
598 |
),
|
599 |
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
600 |
+
Most quality signals were initially introduced by Gopher """, D_cite(bibtex_key="gopher"), """ and subsequently adopted by later
|
601 |
+
studies """, D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="dolma"),D_cite(bibtex_key="fineweb"), """([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
602 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
603 |
outcomes for the same quality signals.
|
604 |
+
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma,""", D_cite(bibtex_key="dolma"), """ DataTrove, """, D_cite(bibtex_key="penedo2024datatrove"), """
|
605 |
+
and RedPajama V2, """, D_cite(bibtex_key="redpajama-v2"), """ and selected the most suitable method based on manual inspections.
|
606 |
"""),
|
607 |
P(B("Repetition-based Heuristics: "), """
|
608 |
Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
|
609 |
+
work, """, D_cite(bibtex_key="gopher"), D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="dolma"), """ we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
|
610 |
"""),
|
611 |
P(B("Fraction of Characters in Repeated Lines: "), """
|
612 |
+
Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents containing mupltiple, short duplicate passages, as well as those with few,
|
613 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
614 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
615 |
"""),
|
|
|
748 |
""",
|
749 |
),
|
750 |
P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
|
751 |
+
Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
|
752 |
fraction of characters contained within the most frequently-occurring n-gram.
|
753 |
"""),
|
754 |
Details(
|
|
|
911 |
""",
|
912 |
),
|
913 |
P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
|
914 |
+
Following Gopher, we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
|
915 |
fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
|
916 |
overlapping n-grams more than once.
|
917 |
"""),
|
|
|
1141 |
),
|
1142 |
P(B("Line-wise Heuristics: "), """
|
1143 |
Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
|
1144 |
+
RefinedWeb, we remove the document if the corrected lines represent more than 5% of words. In line with previous
|
1145 |
+
works, we remove the documents if more than 30% of the lines end with an ellipsis or more than
|
1146 |
90% of lines start with a bullet point.
|
1147 |
"""),
|
1148 |
Details(
|
|
|
1247 |
),
|
1248 |
|
1249 |
P(B("Statistics-based Heuristics: "), """
|
1250 |
+
We summarize other statistics-based rules originated from Gopher in this section. The statistics can be used include:
|
1251 |
"""),
|
1252 |
Ul(
|
1253 |
Li("the word count in the document", style = "margin-bottom: 5px"),
|