omkarenator commited on
Commit
370837e
1 Parent(s): 2783986
Files changed (1) hide show
  1. web.py +21 -14
web.py CHANGED
@@ -272,9 +272,10 @@ def web_data():
272
  of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
273
  documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
274
  """),
275
- Img(
276
- src="path/to/sample_terminal_punctuation_removed.png",
277
- alt="Sample documents with lines that are removed by the rule of terminal punctuation",
 
278
  ),
279
  H4('2.1 Word "Javascript"'),
280
  P("""
@@ -284,9 +285,10 @@ def web_data():
284
  propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
285
  The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
286
  """),
287
- Img(
288
- src="path/to/sample_javascript_removed_kept.png",
289
- alt="Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
 
290
  ),
291
  H4("2.2 Other Rules from RefinedWeb"),
292
  P("""
@@ -296,9 +298,10 @@ def web_data():
296
  - The line matches the pattern “r'^\\d+\\s+likes$'”,
297
  - The line contains only one word.
298
  """),
299
- Img(
300
- src="path/to/sample_refinedweb_rules_removed.png",
301
- alt="Sample documents with lines that are removed by the RefinedWeb rules",
 
302
  ),
303
  H4("2.3 Toxic Lines"),
304
  P("""
@@ -308,15 +311,19 @@ def web_data():
308
  line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
309
  the bad words from English but also consider the bad words from other languages.
310
  """),
311
- Img(
312
- src="path/to/sample_toxic_lines_removed.png",
313
- alt="Sample documents with toxic lines",
314
  ),
315
  H3("3. Document-Level Filtering"),
316
  P("""
317
  In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
318
- Overview of all the quality signals that are used for filtering.
319
- Similar to previous sections, we will present sample documents filtered out by the given quality signals.
 
 
 
 
320
  Most of these quality signals were initially introduced by Gopher [2] and subsequently adopted by later
321
  studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
322
  of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
 
272
  of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
273
  documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
274
  """),
275
+ view_data(
276
+ "data/sample_terminal_punc.json",
277
+ 0,
278
+ "Sample documents with lines that are removed by the rule of terminal punctuation",
279
  ),
280
  H4('2.1 Word "Javascript"'),
281
  P("""
 
285
  propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
286
  The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
287
  """),
288
+ view_data(
289
+ "data/sample_java.jsonl",
290
+ 0,
291
+ "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
292
  ),
293
  H4("2.2 Other Rules from RefinedWeb"),
294
  P("""
 
298
  - The line matches the pattern “r'^\\d+\\s+likes$'”,
299
  - The line contains only one word.
300
  """),
301
+ view_data(
302
+ "data/sample_refinedweb_line.json",
303
+ 0,
304
+ "Sample documents with lines that are removed by the RefinedWeb rules",
305
  ),
306
  H4("2.3 Toxic Lines"),
307
  P("""
 
311
  line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
312
  the bad words from English but also consider the bad words from other languages.
313
  """),
314
+ view_data_static(
315
+ json.load(open("data/toxic_lines.json")),
316
+ "Sample documents with toxic lines",
317
  ),
318
  H3("3. Document-Level Filtering"),
319
  P("""
320
  In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
321
+ Overview of all the quality signals that are used for filtering."""),
322
+ view_data_static(
323
+ json.load(open("data/all_signals.json")),
324
+ "Overview of all the quality signals that are used for filtering",
325
+ ),
326
+ P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
327
  Most of these quality signals were initially introduced by Gopher [2] and subsequently adopted by later
328
  studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
329
  of each quality signal can vary significantly among different dataset pipelines, resulting in disparate