victormiller commited on
Commit
b488013
1 Parent(s): 146aa07

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +1 -2
web.py CHANGED
@@ -217,7 +217,6 @@ def web_data():
217
  ),
218
  H3("1. Document Preparation"),
219
 
220
- button( Div(
221
  H4("1.1 Text Extraction"),
222
  P("""
223
  Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
@@ -226,7 +225,7 @@ def web_data():
226
  we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
227
  Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
228
  """),
229
- DV2("data/sample_wet.json", "data/sample_warc.json", 3),), cls="collapsible"),
230
 
231
  H4("1.2 Language Identification"),
232
  P("""
 
217
  ),
218
  H3("1. Document Preparation"),
219
 
 
220
  H4("1.1 Text Extraction"),
221
  P("""
222
  Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
 
225
  we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
226
  Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
227
  """),
228
+ DV2("data/sample_wet.json", "data/sample_warc.json", 3),
229
 
230
  H4("1.2 Language Identification"),
231
  P("""