victormiller commited on
Commit
860a948
1 Parent(s): 0f2f2a7

Update curated.py

Browse files
Files changed (1) hide show
  1. curated.py +3 -3
curated.py CHANGED
@@ -33,7 +33,7 @@ curated_sources_intro = Div(
33
  P(
34
  "Curated sources comprise high-quality datasets that contain domain-specificity.",
35
  B(
36
- " TxT360 was strongly influenced by The Pile regarding both inclusion of the dataset and filtering techniques."
37
  ),
38
  " These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
39
  ),
@@ -685,7 +685,7 @@ filtering_process = Div(
685
  ),
686
  P(
687
  B("Download and Extraction: "),
688
- "All the data was downloaded in original latex format from Arxiv official S3 dump ",
689
  A("s3://arxic/src", href="s3://arxic/src"),
690
  ". We try to encode the downloaded data into utf-8 or guess encoding using chardet library. After that pandoc was used to extract information from the latex files and saved as markdown format",
691
  D_code(
@@ -703,7 +703,7 @@ filtering_process = Div(
703
  ),
704
  P(
705
  B(" Filters Applied: "),
706
- "multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset (citation needed)",
707
  ),
708
  Ul(
709
  Li(
 
33
  P(
34
  "Curated sources comprise high-quality datasets that contain domain-specificity.",
35
  B(
36
+ " TxT360 was strongly influenced by The Pile", D_cite(bibtex_key="thepile"), " regarding both inclusion of the dataset and filtering techniques."
37
  ),
38
  " These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
39
  ),
 
685
  ),
686
  P(
687
  B("Download and Extraction: "),
688
+ "All the data was downloaded in original latex format from ArXiv official S3 repo: ",
689
  A("s3://arxic/src", href="s3://arxic/src"),
690
  ". We try to encode the downloaded data into utf-8 or guess encoding using chardet library. After that pandoc was used to extract information from the latex files and saved as markdown format",
691
  D_code(
 
703
  ),
704
  P(
705
  B(" Filters Applied: "),
706
+ "multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset", D_cite(bibtex_key="peS2o"),
707
  ),
708
  Ul(
709
  Li(