victormiller commited on
Commit
c2f326c
1 Parent(s): ed640d3

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +2 -2
web.py CHANGED
@@ -546,7 +546,7 @@ def web_data():
546
  CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
547
  of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
548
  documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
549
- """)
550
 
551
  Details(
552
  Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
@@ -568,7 +568,7 @@ def web_data():
568
  In our pipeline, we
569
  propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
570
  The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
571
- """)
572
  Details(
573
  Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
574
  DV(
 
546
  CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
547
  of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
548
  documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
549
+ """),
550
 
551
  Details(
552
  Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
 
568
  In our pipeline, we
569
  propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
570
  The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
571
+ """),
572
  Details(
573
  Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
574
  DV(