StarCoder2 Data

community

AI & ML interests

None defined yet.

The Stack v2 Training Data

This organization contains the full datasets used to train StarCoder2:

  • the-stack-v2-train-full: contains the training data with 600+ programming languages used to train StarCoder2-15B with the files concatenated per repository
  • the-stack-v2-train-full-files: same as the-stack-v2-train-full but without repository concatenation which makes filtering files or licenses easier
  • the-stack-v2-train-smol: contains the training data with 17 programming languages used to train StarCoder2-3B and 7B with the files concatenated per repository
  • the-stack-v2-train-smol-files: same as the-stack-v2-train-smol but without repository concatenation which makes filtering files or licenses easier

See the tech report for all the details on the dataset.

models

None public yet

datasets

None public yet