CoLoR-filter
See accompanying code at: https://github.com/davidbrandfonbrener/color-filter-olmo
If you only want to download the filtered, untokenized data, see: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4
Usage
To download the data, we recommend using the huggingface-cli.
To download all the data, run huggingface-cli download hlzhang109/CoLoR-filter --local-dir YOUR_PATH
.
This will download the data to your huggingface cache and create a local-dir with symbolic links to the data. If you actually want the data at YOUR_PATH
, set it as the --cache-dir
in the command.
WARNING: the data is large since it contains a copy of tokenized C4 to ensure that the selected data indices match with the tokenized raw data. The C4 data is ~300GB and the rest of the repo is ~50GB of which ~45GB is the 1.2B model and optimizer checkpoints.
If you only want to download some files (e.g. just the models), use the cli. For example, huggingface-cli download hlzhang109/CoLoR-filter --local-dir YOUR_PATH --include "models/*"
.
Citation
If you use this code in your research, please cite the following paper:
@article{brandfonbrener2024color,
title={CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training},
author={Brandfonbrener, David and Zhang, Hanlin and Kirsch, Andreas and Schwarz, Jonathan Richard and Kakade, Sham M},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}