victormiller
commited on
Commit
•
7602a54
1
Parent(s):
b143332
Update web.py
Browse files
web.py
CHANGED
@@ -598,7 +598,7 @@ def web_data():
|
|
598 |
),
|
599 |
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
600 |
Most quality signals were initially introduced by Gopher """, D_cite(bibtex_key="gopher"), """ and subsequently adopted by later
|
601 |
-
studies """, D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="dolma"),D_cite(bibtex_key="fineweb"), """(
|
602 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
603 |
outcomes for the same quality signals.
|
604 |
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma,""", D_cite(bibtex_key="dolma"), """ DataTrove, """, D_cite(bibtex_key="penedo2024datatrove"), """
|
|
|
598 |
),
|
599 |
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
600 |
Most quality signals were initially introduced by Gopher """, D_cite(bibtex_key="gopher"), """ and subsequently adopted by later
|
601 |
+
studies """, D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="dolma"),D_cite(bibtex_key="fineweb"), """(. However, we observed that, despite following the same descriptions, the implementation
|
602 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
603 |
outcomes for the same quality signals.
|
604 |
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma,""", D_cite(bibtex_key="dolma"), """ DataTrove, """, D_cite(bibtex_key="penedo2024datatrove"), """
|