victormiller commited on
Commit
4f82979
1 Parent(s): 600ab03

Update main.py

Browse files
Files changed (1) hide show
  1. main.py +12 -40
main.py CHANGED
@@ -34,7 +34,7 @@ def main():
34
  D_front_matter(),
35
  D_title(
36
  H1(
37
- "TxT360: fully open and transparent fusion of web and curated corpora for pre-training large language models",
38
  cls="l-body",
39
  style="text-align: center;",
40
  ),
@@ -123,45 +123,17 @@ def intro():
123
  return Div(
124
  Section(
125
  H2("Introduction"),
126
- P("""We are excited to introduce TxT360, a
127
- large-scale, comprehensive, and fully transparent
128
- dataset designed for Large Language Model (LLM)
129
- pre-training. TxT360 is engineered to strike a
130
- balance between the quantity and quality of
131
- pre-training data, pushing the limit on both
132
- fronts. This comprehensive dataset encompasses both
133
- expansive web-based data and highly curated data
134
- sources, making it one of the most robust LLM
135
- pre-training corpora available today. Our web data
136
- component includes 99 snapshots from Common Crawl,
137
- amassing 5.7 trillion tokens and occupying 11 TB of
138
- disk space in jsonl.gz format. On the curated side,
139
- TxT360 integrates one of the most extensive
140
- collections of high-quality sources across multiple
141
- domains, ensuring diverse and rich content referred
142
- to as curated sources, 14 sources across 10
143
- domains. To maintain the highest quality, we
144
- meticulously pre-processed the web data to filter
145
- out low-quality content and conducted thorough
146
- reviews of the curated sources. This process not
147
- only unified their formats but also identified and
148
- rectified any anomalies. Not only do we 100%
149
- open-source our processing scripts, but we also
150
- release the details of our data reviews, revealing
151
- the decision-making processes behind data selection
152
- and quality assurance. This level of transparency
153
- allows researchers and practitioners to fully
154
- understand the dataset’s composition and make
155
- informed decisions when using TxT360 for training.
156
- Additionally, TxT360 includes detailed
157
- documentation and analysis of the data, covering
158
- distribution statistics, domain coverage, and
159
- processing pipeline, which helps users navigate and
160
- utilize the dataset effectively. Overall, TxT360
161
- represents a significant step forward in the
162
- availability and transparency of large-scale
163
- training data for language models, setting a new
164
- standard for dataset quality and openness."""),
165
  id="section1",
166
  ),
167
  Section(
 
34
  D_front_matter(),
35
  D_title(
36
  H1(
37
+ "TxT360: the most comprehensive, highest quality, and production ready pretraining dataset",
38
  cls="l-body",
39
  style="text-align: center;",
40
  ),
 
123
  return Div(
124
  Section(
125
  H2("Introduction"),
126
+ P("""Pretraining performant large language models (LLMs) requires trillions of tokens of high quality data. Many prior work, including our previous pretraining projects Amber-7B, Crystal-7B, and K2-65B have demonstrated how data curation is a ‘make-or-break’ decision for model quality and capability.
127
+
128
+ We present TxT360, the Trillion eXtracted Text corpus, a 5.7T token dataset for pretraining projects that:
129
+
130
+
131
+ 1. Curates commonly used pretraining datasets, including all CommonCrawl
132
+ 2. Employs carefully selected filters designed for each data source
133
+ 3. Provides only unique data elements via globally deduplicated across all datasets
134
+ 4. Retains all deduplication metadata for custom upweighting
135
+ 5. Is Production ready! Download here [link to HF repo]
136
+ """),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  id="section1",
138
  ),
139
  Section(