victormiller commited on
Commit
7602a54
1 Parent(s): b143332

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +1 -1
web.py CHANGED
@@ -598,7 +598,7 @@ def web_data():
598
  ),
599
  P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
600
  Most quality signals were initially introduced by Gopher """, D_cite(bibtex_key="gopher"), """ and subsequently adopted by later
601
- studies """, D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="dolma"),D_cite(bibtex_key="fineweb"), """([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
602
  of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
603
  outcomes for the same quality signals.
604
  In our pipeline, we referenced earlier implementations that were publicly available such as Dolma,""", D_cite(bibtex_key="dolma"), """ DataTrove, """, D_cite(bibtex_key="penedo2024datatrove"), """
 
598
  ),
599
  P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
600
  Most quality signals were initially introduced by Gopher """, D_cite(bibtex_key="gopher"), """ and subsequently adopted by later
601
+ studies """, D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="dolma"),D_cite(bibtex_key="fineweb"), """(. However, we observed that, despite following the same descriptions, the implementation
602
  of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
603
  outcomes for the same quality signals.
604
  In our pipeline, we referenced earlier implementations that were publicly available such as Dolma,""", D_cite(bibtex_key="dolma"), """ DataTrove, """, D_cite(bibtex_key="penedo2024datatrove"), """