Spaces:
Sleeping
Sleeping
victormiller
commited on
Commit
•
4f82979
1
Parent(s):
600ab03
Update main.py
Browse files
main.py
CHANGED
@@ -34,7 +34,7 @@ def main():
|
|
34 |
D_front_matter(),
|
35 |
D_title(
|
36 |
H1(
|
37 |
-
"TxT360:
|
38 |
cls="l-body",
|
39 |
style="text-align: center;",
|
40 |
),
|
@@ -123,45 +123,17 @@ def intro():
|
|
123 |
return Div(
|
124 |
Section(
|
125 |
H2("Introduction"),
|
126 |
-
P("""
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
amassing 5.7 trillion tokens and occupying 11 TB of
|
138 |
-
disk space in jsonl.gz format. On the curated side,
|
139 |
-
TxT360 integrates one of the most extensive
|
140 |
-
collections of high-quality sources across multiple
|
141 |
-
domains, ensuring diverse and rich content referred
|
142 |
-
to as curated sources, 14 sources across 10
|
143 |
-
domains. To maintain the highest quality, we
|
144 |
-
meticulously pre-processed the web data to filter
|
145 |
-
out low-quality content and conducted thorough
|
146 |
-
reviews of the curated sources. This process not
|
147 |
-
only unified their formats but also identified and
|
148 |
-
rectified any anomalies. Not only do we 100%
|
149 |
-
open-source our processing scripts, but we also
|
150 |
-
release the details of our data reviews, revealing
|
151 |
-
the decision-making processes behind data selection
|
152 |
-
and quality assurance. This level of transparency
|
153 |
-
allows researchers and practitioners to fully
|
154 |
-
understand the dataset’s composition and make
|
155 |
-
informed decisions when using TxT360 for training.
|
156 |
-
Additionally, TxT360 includes detailed
|
157 |
-
documentation and analysis of the data, covering
|
158 |
-
distribution statistics, domain coverage, and
|
159 |
-
processing pipeline, which helps users navigate and
|
160 |
-
utilize the dataset effectively. Overall, TxT360
|
161 |
-
represents a significant step forward in the
|
162 |
-
availability and transparency of large-scale
|
163 |
-
training data for language models, setting a new
|
164 |
-
standard for dataset quality and openness."""),
|
165 |
id="section1",
|
166 |
),
|
167 |
Section(
|
|
|
34 |
D_front_matter(),
|
35 |
D_title(
|
36 |
H1(
|
37 |
+
"TxT360: the most comprehensive, highest quality, and production ready pretraining dataset",
|
38 |
cls="l-body",
|
39 |
style="text-align: center;",
|
40 |
),
|
|
|
123 |
return Div(
|
124 |
Section(
|
125 |
H2("Introduction"),
|
126 |
+
P("""Pretraining performant large language models (LLMs) requires trillions of tokens of high quality data. Many prior work, including our previous pretraining projects Amber-7B, Crystal-7B, and K2-65B have demonstrated how data curation is a ‘make-or-break’ decision for model quality and capability.
|
127 |
+
|
128 |
+
We present TxT360, the Trillion eXtracted Text corpus, a 5.7T token dataset for pretraining projects that:
|
129 |
+
|
130 |
+
|
131 |
+
1. Curates commonly used pretraining datasets, including all CommonCrawl
|
132 |
+
2. Employs carefully selected filters designed for each data source
|
133 |
+
3. Provides only unique data elements via globally deduplicated across all datasets
|
134 |
+
4. Retains all deduplication metadata for custom upweighting
|
135 |
+
5. Is Production ready! Download here [link to HF repo]
|
136 |
+
"""),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
137 |
id="section1",
|
138 |
),
|
139 |
Section(
|