victormiller commited on
Commit
fc84524
1 Parent(s): 600ab03

Update main.py

Browse files

updated the wording and styling.
still need to work on the ordered list for the 1-5

Files changed (1) hide show
  1. main.py +44 -61
main.py CHANGED
@@ -118,55 +118,18 @@ def main():
118
  )
119
 
120
 
121
- @app.get("/intro")
122
- def intro():
123
- return Div(
124
- Section(
125
- H2("Introduction"),
126
- P("""We are excited to introduce TxT360, a
127
- large-scale, comprehensive, and fully transparent
128
- dataset designed for Large Language Model (LLM)
129
- pre-training. TxT360 is engineered to strike a
130
- balance between the quantity and quality of
131
- pre-training data, pushing the limit on both
132
- fronts. This comprehensive dataset encompasses both
133
- expansive web-based data and highly curated data
134
- sources, making it one of the most robust LLM
135
- pre-training corpora available today. Our web data
136
- component includes 99 snapshots from Common Crawl,
137
- amassing 5.7 trillion tokens and occupying 11 TB of
138
- disk space in jsonl.gz format. On the curated side,
139
- TxT360 integrates one of the most extensive
140
- collections of high-quality sources across multiple
141
- domains, ensuring diverse and rich content referred
142
- to as curated sources, 14 sources across 10
143
- domains. To maintain the highest quality, we
144
- meticulously pre-processed the web data to filter
145
- out low-quality content and conducted thorough
146
- reviews of the curated sources. This process not
147
- only unified their formats but also identified and
148
- rectified any anomalies. Not only do we 100%
149
- open-source our processing scripts, but we also
150
- release the details of our data reviews, revealing
151
- the decision-making processes behind data selection
152
- and quality assurance. This level of transparency
153
- allows researchers and practitioners to fully
154
- understand the dataset’s composition and make
155
- informed decisions when using TxT360 for training.
156
- Additionally, TxT360 includes detailed
157
- documentation and analysis of the data, covering
158
- distribution statistics, domain coverage, and
159
- processing pipeline, which helps users navigate and
160
- utilize the dataset effectively. Overall, TxT360
161
- represents a significant step forward in the
162
- availability and transparency of large-scale
163
- training data for language models, setting a new
164
- standard for dataset quality and openness."""),
165
- id="section1",
166
- ),
167
- Section(
168
- H2("Background"),
169
- P(
170
  """ The quality and size of a pre-training dataset
171
  play a crucial role in the performance of large
172
  language models (LLMs). The community has
@@ -196,12 +159,8 @@ def intro():
196
  sources, TxT360 is crafted to meet and surpass the
197
  rigorous standards required for state-of-the-art
198
  LLM pre-training. """
199
- ),
200
- id="section2",
201
- ),
202
- Section(
203
- H2("Main Content"),
204
- P("""The performance of a large language model (LLM)
205
  depends heavily on the quality and size of its
206
  pretraining dataset. However, the pretraining
207
  datasets for state-of-the-art open LLMs like Llama
@@ -245,20 +204,44 @@ def intro():
245
  data quality at scale, the 🍷 FineWeb recipe
246
  (listing and explaining all of our design choices),
247
  and the process followed to create its 📚
248
- FineWeb-Edu subset."""),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
  id="section3",
250
  ),
251
  Section(
252
- H2("Conclusion"),
253
- P("""This is the conclusion section where we
254
- summarize the key points discussed in the blog post
255
- and provide final thoughts."""),
 
256
  id="section4",
257
  ),
258
  id="inner-text",
259
  )
260
 
261
-
262
  rt("/curated")(curated.curated)
263
 
264
  rt("/webdata")(web.web_data)
 
118
  )
119
 
120
 
121
+ intro_text = P(
122
+ """Pretraining performant large language models (LLMs) requires trillions of tokens of high quality data. Many prior work, including our previous pretraining projects Amber-7B, Crystal-7B, and K2-65B have demonstrated how data curation is a ‘make-or-break’ decision for model quality and capability.""")
123
+
124
+ intro_list = P("""We present TxT360, the Trillion eXtracted Text corpus, a 5.7T token dataset for pretraining projects that:""")
125
+
126
+ intro_1 = P("1. Curates commonly used pretraining datasets, including all CommonCrawl")
127
+ intro_2 = P("2. Employs carefully selected filters designed for each data source")
128
+ intro_3 = P("3. Provides only unique data elements via globally deduplicated across all datasets")
129
+ intro_4 = P("4. Retains all deduplication metadata for custom upweighting")
130
+ intro_5 = P("5. Is Production ready! Download here [link to HF repo]")
131
+
132
+ previous_background = P(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  """ The quality and size of a pre-training dataset
134
  play a crucial role in the performance of large
135
  language models (LLMs). The community has
 
159
  sources, TxT360 is crafted to meet and surpass the
160
  rigorous standards required for state-of-the-art
161
  LLM pre-training. """
162
+ )
163
+ previous_content = P("""The performance of a large language model (LLM)
 
 
 
 
164
  depends heavily on the quality and size of its
165
  pretraining dataset. However, the pretraining
166
  datasets for state-of-the-art open LLMs like Llama
 
204
  data quality at scale, the 🍷 FineWeb recipe
205
  (listing and explaining all of our design choices),
206
  and the process followed to create its 📚
207
+ FineWeb-Edu subset.""")
208
+
209
+ @app.get("/intro")
210
+ def intro():
211
+ return Div(
212
+ Section(
213
+ H2("Introduction"),
214
+ intro_text,
215
+ intro_list,
216
+ intro_1,
217
+ intro_2,
218
+ intro_3,
219
+ intro_4,
220
+ intro_5,
221
+ id="section1",
222
+ ),
223
+ Section(
224
+ H3("Global Deduplication"),
225
+ P("TxT360 curated a wide range of datasets, including a whopping 99 Common Crawl Dumps and a list of high quality datasets: StackExchange, Wikipedia, Arxiv, USPTO, DM Math, HackerNews, Ubuntu IRC, Europarl, FreeLaw, PG19, S2ORC, PhilPapers, PubMed Abstracts, and PubMed Central. For the first time in a released dataset, we locally and globally deduplicated the data across each dataset creating the highest quality data available."),
226
+ id="section2",
227
+ ),
228
+ Section(
229
+ H3("Main Content"),
230
+ P("In large-scale corpora like CommonCrawl, text duplication is a frequent occurrence. Duplication can be considered as a natural upsampling of some data points. Recent studies have highlighted the potential drawbacks of oversampling specific data points, which can negatively impact pretraining performance [2205.10487]. However, when samples are repeated appropriately, the performance can actually improve [2306.01116, 2305.16264, 2406.11794, FineWeb]. Despite this, there is currently no widely accepted best practice for data sampling, and it’s unlikely that a one-size-fits-all approach will emerge given the scale of these datasets. Previous work either leaves the deduplication process to the user (as seen in RedPajama V2 and DCLM-Pool) or provides a corpus that has been downsampled in a specific manner (such as in FineWeb and RefinedWeb)."),
231
+ P("Given the high cost of deduplication, TxT360 offers a complete deduplication across all datasets (so you don’t have to). Additionally, TxT360 maintains detailed metadata for each sample, including the frequency and location of duplicates. This metadata gives pretrainers the flexibility to adjust the weight of samples as needed. In principle, one can recover the original dataset distribution (footnote: this approach also means a smaller size on disk). We will demonstrate a simple upsampling strategy that results in an effective pretraining dataset. "),
232
  id="section3",
233
  ),
234
  Section(
235
+ H3("Full and Openly Documented Production Ready Pretraining Corpus"),
236
+ P("We cover every aspect of the decisions made to produce the dataset, including document selection, filtering, quality assurance, deduplication, standardization and PII. Our reasoning is thoroughly explained, ensuring transparency and replicability. "),
237
+ P("Our code is open sourced here[link to github]."),
238
+ P("The dataset is ready for immediate download directly from Hugging Face [link]."),
239
+ P("In the remainder of this blog post, we will walk you through the entire process and the rationale behind each decision. Enjoy!"),
240
  id="section4",
241
  ),
242
  id="inner-text",
243
  )
244
 
 
245
  rt("/curated")(curated.curated)
246
 
247
  rt("/webdata")(web.web_data)