CoLoR-filter

See accompanying code at: https://github.com/davidbrandfonbrener/color-filter-olmo

If you only want to download the filtered, untokenized data, see: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

Usage

To download the data, we recommend using the huggingface-cli.

To download all the data, run huggingface-cli download hlzhang109/CoLoR-filter --local-dir YOUR_PATH.

This will download the data to your huggingface cache and create a local-dir with symbolic links to the data. If you actually want the data at YOUR_PATH, set it as the --cache-dir in the command.

WARNING: the data is large since it contains a copy of tokenized C4 to ensure that the selected data indices match with the tokenized raw data. The C4 data is ~300GB and the rest of the repo is ~50GB of which ~45GB is the 1.2B model and optimizer checkpoints.

If you only want to download some files (e.g. just the models), use the cli. For example, huggingface-cli download hlzhang109/CoLoR-filter --local-dir YOUR_PATH --include "models/*".

Citation

If you use this code in your research, please cite the following paper:

@article{brandfonbrener2024color,
  title={CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training},
  author={Brandfonbrener, David and Zhang, Hanlin and Kirsch, Andreas and Schwarz, Jonathan Richard and Kakade, Sham M},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}