victormiller commited on
Commit
03ad039
1 Parent(s): 6d51d72

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +5 -2
web.py CHANGED
@@ -431,8 +431,11 @@ def web_data():
431
  """),
432
  P("We directly read WARC files instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
433
 
434
-
435
- DV2("data/sample_wet.json", "data/sample_warc.json", 3),
 
 
 
436
 
437
  H4("1.2 Language Identification"),
438
  P("""
 
431
  """),
432
  P("We directly read WARC files instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
433
 
434
+ Details(
435
+ Summary("Open Me - WARC TEST"),
436
+ DV2("data/sample_wet.json", "data/sample_warc.json", 3),
437
+ ),
438
+ #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
439
 
440
  H4("1.2 Language Identification"),
441
  P("""