victormiller
commited on
Commit
•
66a1161
1
Parent(s):
081c95c
Update web.py
Browse files
web.py
CHANGED
@@ -249,7 +249,7 @@ def web_data():
|
|
249 |
P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
|
250 |
Ul(
|
251 |
Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
|
252 |
-
Li("Document
|
253 |
Li("Line-Level Filtering", style = "margin-bottom: 5px"),
|
254 |
Li("Local Deduplication", style = "margin-bottom: 5px"),
|
255 |
Li("Each section is complete with code and comparisons to Dolma,", D_cite(bibtex_key="soldaini2024dolma"),
|
@@ -263,9 +263,9 @@ def web_data():
|
|
263 |
H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
|
264 |
P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
|
265 |
table_div_filter_data,
|
266 |
-
P("The table below provides a comparison of the quality filters that have been applied to each dataset. Of note, TxT360 does not use any machine learning (ML) based filters. ML filters are a useful and
|
267 |
table_div_qf_filter_data,
|
268 |
-
P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across
|
269 |
Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
|
270 |
P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
|
271 |
id="section2",),
|
@@ -278,9 +278,8 @@ def web_data():
|
|
278 |
WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
|
279 |
WET files contain plaintexts extracted by Common Crawl. In line with previous works""",D_cite(bibtex_key="thepile"),D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="gopher"),D_cite(bibtex_key="fineweb") ,""" ,
|
280 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
281 |
-
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
282 |
"""),
|
283 |
-
P("We directly read WARC files instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
|
284 |
|
285 |
Details(
|
286 |
Summary("Text Extraction Examples"),
|
@@ -338,7 +337,7 @@ def web_data():
|
|
338 |
|
339 |
P(B("URL Filtering: "), """
|
340 |
The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
|
341 |
-
out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated
|
342 |
"""),
|
343 |
|
344 |
P(B("URL Blocklist: "), """
|
@@ -579,7 +578,7 @@ def web_data():
|
|
579 |
work, """, D_cite(bibtex_key="gopher"), D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="dolma"), """ we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
|
580 |
"""),
|
581 |
P(B("Fraction of Characters in Repeated Lines: "), """
|
582 |
-
Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents containing
|
583 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
584 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
585 |
"""),
|
|
|
249 |
P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
|
250 |
Ul(
|
251 |
Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
|
252 |
+
Li("Document Preparation", style = "margin-bottom: 5px"),
|
253 |
Li("Line-Level Filtering", style = "margin-bottom: 5px"),
|
254 |
Li("Local Deduplication", style = "margin-bottom: 5px"),
|
255 |
Li("Each section is complete with code and comparisons to Dolma,", D_cite(bibtex_key="soldaini2024dolma"),
|
|
|
263 |
H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
|
264 |
P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
|
265 |
table_div_filter_data,
|
266 |
+
P("The table below provides a comparison of the quality filters that have been applied to each dataset. Of note, TxT360 does not use any machine learning (ML) based filters. ML filters are a useful and efficient filtering processing that should be consider for any filtering project. However, we are leaving this to future work."),
|
267 |
table_div_qf_filter_data,
|
268 |
+
P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across snapshots. "),
|
269 |
Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
|
270 |
P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
|
271 |
id="section2",),
|
|
|
278 |
WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
|
279 |
WET files contain plaintexts extracted by Common Crawl. In line with previous works""",D_cite(bibtex_key="thepile"),D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="gopher"),D_cite(bibtex_key="fineweb") ,""" ,
|
280 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
|
|
281 |
"""),
|
282 |
+
P("We directly read WARC files with the warcio library instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
|
283 |
|
284 |
Details(
|
285 |
Summary("Text Extraction Examples"),
|
|
|
337 |
|
338 |
P(B("URL Filtering: "), """
|
339 |
The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
|
340 |
+
out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated data (e.g. wikipedia.org) to avoid duplication.
|
341 |
"""),
|
342 |
|
343 |
P(B("URL Blocklist: "), """
|
|
|
578 |
work, """, D_cite(bibtex_key="gopher"), D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="dolma"), """ we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
|
579 |
"""),
|
580 |
P(B("Fraction of Characters in Repeated Lines: "), """
|
581 |
+
Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents containing multiple, short duplicate passages, as well as those with few,
|
582 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
583 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
584 |
"""),
|