victormiller
commited on
Commit
•
c2f326c
1
Parent(s):
ed640d3
Update web.py
Browse files
web.py
CHANGED
@@ -546,7 +546,7 @@ def web_data():
|
|
546 |
CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
|
547 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
548 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
549 |
-
""")
|
550 |
|
551 |
Details(
|
552 |
Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
|
@@ -568,7 +568,7 @@ def web_data():
|
|
568 |
In our pipeline, we
|
569 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
570 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
571 |
-
""")
|
572 |
Details(
|
573 |
Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
|
574 |
DV(
|
|
|
546 |
CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
|
547 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
548 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
549 |
+
"""),
|
550 |
|
551 |
Details(
|
552 |
Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
|
|
|
568 |
In our pipeline, we
|
569 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
570 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
571 |
+
"""),
|
572 |
Details(
|
573 |
Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
|
574 |
DV(
|