metadata

datasets:
  - BAAI/CCI3-HQ
  - BAAI/CCI3-Data
  - Skywork/SkyPile-150B

CCI3-HQ-Intermediate-Checkpoints

In the CCI3.0-HQ tech report, we conducted a direct comparison of different datasets through end-to-end pre-training experiments. The performance evaluation was based on both the final checkpoints and intermediate checkpoints of model training in two experiments: the Mixed Dataset Experiment and the Chinese Dataset Experiment.

To closely monitor and compare the performance of various datasets throughout the training process, we saved intermediate checkpoints at approximately every 20 billion tokens of training. This allowed us to track progress and changes in performance over time.

Below, we list all checkpoints from the models trained in all comparison experiments. The suffix "-zh" represents checkpoints from the Chinese Dataset Experiment, while the suffix "-mix" indicates checkpoints from the Mixed Dataset Experiment.

Citation Information

You can cite our paper:

@misc{wang2024cci30hqlargescalechinesedataset,
      title={CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models}, 
      author={Liangdong Wang and Bo-Wen Zhang and Chengwei Wu and Hanyu Zhao and Xiaofeng Shi and Shuhao Gu and Jijie Li and Quanyue Ma and TengFei Pan and Guang Liu},
      year={2024},
      eprint={2410.18505},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.18505}, 
}