victormiller
commited on
Commit
•
073687e
1
Parent(s):
1e4c4a2
Update web.py
Browse files
web.py
CHANGED
@@ -442,15 +442,13 @@ def web_data():
|
|
442 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
443 |
This step removes over 60% of the whole data.
|
444 |
"""),
|
445 |
-
Details(
|
446 |
-
Summary("Sample documents that are classified as non-English"),
|
447 |
-
DV("data/sample_non_en.json", 3),
|
448 |
-
),
|
449 |
|
450 |
-
|
451 |
-
|
452 |
-
|
453 |
-
|
|
|
|
|
454 |
|
455 |
H4("1.3 URL Filtering"),
|
456 |
P("""
|
@@ -483,10 +481,9 @@ def web_data():
|
|
483 |
"curated url domains that are excluded from our dataset",
|
484 |
),
|
485 |
|
486 |
-
|
487 |
-
|
488 |
-
|
489 |
-
),
|
490 |
H3("2. Line-Level Removal"),
|
491 |
P("""
|
492 |
Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
|
|
|
442 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
443 |
This step removes over 60% of the whole data.
|
444 |
"""),
|
|
|
|
|
|
|
|
|
445 |
|
446 |
+
|
447 |
+
DV("data/sample_non_en.json", 3, "Sample documents that are classified as non-English"),
|
448 |
+
|
449 |
+
|
450 |
+
DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
|
451 |
+
|
452 |
|
453 |
H4("1.3 URL Filtering"),
|
454 |
P("""
|
|
|
481 |
"curated url domains that are excluded from our dataset",
|
482 |
),
|
483 |
|
484 |
+
|
485 |
+
DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
|
486 |
+
|
|
|
487 |
H3("2. Line-Level Removal"),
|
488 |
P("""
|
489 |
Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
|