fh-new-vm1

Sleeping

App Files Files Community

victormiller commited on Sep 24

Commit

4f82979

•

1 Parent(s): 600ab03

Update main.py

Browse files

Files changed (1) hide show

main.py +12 -40

main.py CHANGED Viewed

@@ -34,7 +34,7 @@ def main():
         D_front_matter(),
         D_title(
             H1(
-                "TxT360: fully open and transparent fusion of web and curated corpora for pre-training large language models",
                 cls="l-body",
                 style="text-align: center;",
             ),
@@ -123,45 +123,17 @@ def intro():
     return Div(
         Section(
             H2("Introduction"),
-            P("""We are excited to introduce TxT360, a
-                large-scale, comprehensive, and fully transparent
-                dataset designed for Large Language Model (LLM)
-                pre-training. TxT360 is engineered to strike a
-                balance between the quantity and quality of
-                pre-training data, pushing the limit on both
-                fronts. This comprehensive dataset encompasses both
-                expansive web-based data and highly curated data
-                sources, making it one of the most robust LLM
-                pre-training corpora available today.  Our web data
-                component includes 99 snapshots from Common Crawl,
-                amassing 5.7 trillion tokens and occupying 11 TB of
-                disk space in jsonl.gz format. On the curated side,
-                TxT360 integrates one of the most extensive
-                collections of high-quality sources across multiple
-                domains, ensuring diverse and rich content referred
-                to as curated sources, 14 sources across 10
-                domains.  To maintain the highest quality, we
-                meticulously pre-processed the web data to filter
-                out low-quality content and conducted thorough
-                reviews of the curated sources. This process not
-                only unified their formats but also identified and
-                rectified any anomalies. Not only do we 100%
-                open-source our processing scripts, but we also
-                release the details of our data reviews, revealing
-                the decision-making processes behind data selection
-                and quality assurance.  This level of transparency
-                allows researchers and practitioners to fully
-                understand the dataset’s composition and make
-                informed decisions when using TxT360 for training.
-                Additionally, TxT360 includes detailed
-                documentation and analysis of the data, covering
-                distribution statistics, domain coverage, and
-                processing pipeline, which helps users navigate and
-                utilize the dataset effectively.  Overall, TxT360
-                represents a significant step forward in the
-                availability and transparency of large-scale
-                training data for language models, setting a new
-                standard for dataset quality and openness."""),
             id="section1",
         ),
         Section(

         D_front_matter(),
         D_title(
             H1(
+                "TxT360: the most comprehensive, highest quality, and production ready pretraining dataset",
                 cls="l-body",
                 style="text-align: center;",
             ),
     return Div(
         Section(
             H2("Introduction"),
+            P("""Pretraining performant large language models (LLMs) requires trillions of tokens of high quality data. Many prior work, including our previous pretraining projects Amber-7B, Crystal-7B, and K2-65B have demonstrated how data curation is a ‘make-or-break’ decision for model quality and capability.
+We present TxT360, the Trillion eXtracted Text corpus, a 5.7T token dataset for pretraining projects that:
+1. Curates commonly used pretraining datasets, including all CommonCrawl
+2. Employs carefully selected filters designed for each data source
+3. Provides only unique data elements via globally deduplicated across all datasets
+4. Retains all deduplication metadata for custom upweighting
+5. Is Production ready! Download here [link to HF repo]
+"""),
             id="section1",
         ),
         Section(