victormiller
commited on
Commit
•
146aa07
1
Parent(s):
466af30
Update web.py
Browse files
web.py
CHANGED
@@ -216,6 +216,8 @@ def web_data():
|
|
216 |
style="margin-top: 20px;",
|
217 |
),
|
218 |
H3("1. Document Preparation"),
|
|
|
|
|
219 |
H4("1.1 Text Extraction"),
|
220 |
P("""
|
221 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
@@ -224,7 +226,8 @@ def web_data():
|
|
224 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
225 |
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
226 |
"""),
|
227 |
-
DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
|
|
228 |
H4("1.2 Language Identification"),
|
229 |
P("""
|
230 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
|
|
216 |
style="margin-top: 20px;",
|
217 |
),
|
218 |
H3("1. Document Preparation"),
|
219 |
+
|
220 |
+
button( Div(
|
221 |
H4("1.1 Text Extraction"),
|
222 |
P("""
|
223 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
|
|
226 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
227 |
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
228 |
"""),
|
229 |
+
DV2("data/sample_wet.json", "data/sample_warc.json", 3),), cls="collapsible"),
|
230 |
+
|
231 |
H4("1.2 Language Identification"),
|
232 |
P("""
|
233 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|