StarCoder2 Data
community
AI & ML interests
None defined yet.
Organization Card
The Stack v2 Training Data
This organization contains the full datasets used to train StarCoder2:
the-stack-v2-train-full
: contains the training data with 600+ programming languages used to train StarCoder2-15B with the files concatenated per repositorythe-stack-v2-train-full-files
: same asthe-stack-v2-train-full
but without repository concatenation which makes filtering files or licenses easierthe-stack-v2-train-smol
: contains the training data with 17 programming languages used to train StarCoder2-3B and 7B with the files concatenated per repositorythe-stack-v2-train-smol-files
: same asthe-stack-v2-train-smol
but without repository concatenation which makes filtering files or licenses easier
See the tech report for all the details on the dataset.
models
None public yet
datasets
None public yet