Inquiry on Open-Sourcing Model Checkpoints

#3
by Mars2050 - opened

I hope this message finds you well. Sincerest gratitude to your team for your invaluable contributions to the open-source community.

I am particularly interested in the model training process and was curious to know if there are any plans to open-source the training checkpoints as well. The availability of such checkpoints would be a tremendous asset, allowing for a deeper understanding and further improvement of the models.

If these have already been made available, could you please guide me on how to access them? Your assistance in this regard would be highly appreciated.

LLM360 org

The intermediate checkpoints have already been available, for example you can use model = LlamaForCausalLM.from_pretrained("LLM360/Amber", revision="ckpt_100")to get the 100th checkpoint. Each checkpoint is stored in a separate branch in https://huggingface.co/LLM360/Amber/tree/main

@Tianhua thank you for sharing the intermediate checkpoints! I'm trying to find out what the integer of the checkpoint (e.g. 100) means. Could you please clarify? What does the 100th checkpoint actually mean? Are the checkpoints equally spaced throughout pretraining or something else?

@Tianhua Could you please clarify what the integer in the intermediate checkpoint represents? I am trying to understand whether it can be interpreted in the format "1259.13B/359". Specifically, does this suggest that the checkpoints are equally spaced throughout the pretraining tokens? I appreciate your assistance with this matter.

LLM360 org

@Tianhua thank you for sharing the intermediate checkpoints! I'm trying to find out what the integer of the checkpoint (e.g. 100) means. Could you please clarify? What does the 100th checkpoint actually mean? Are the checkpoints equally spaced throughout pretraining or something else?

Hi, @RylanSchaeffer , @Chasell Yes these checkpoints are equally spaced (with regard to training steps, which means they are also equally spaced with regard to training tokens). For Amber, we have 360 checkpoints equally spaced.

This might be a bit more clear if you look at our W&B page https://wandb.ai/llm360.

For example, here is the model's performance on OpenBookQA on each checkpoint, at checkpoint 100 we have a score of 37.2. By specifying revision="ckpt_100", you will get exactly this checkpoint during the training.
Screenshot 2024-04-12 at 11.05.02 AM.png

Hi, could you explain approximately how many sentences or tokens were trained on between each checkpoint please? If you trained on the entire training data I guess this would be the training data size divided by the number of checkpoints.

Thanks for a very useful resource!

LLM360 org

Hi, could you explain approximately how many sentences or tokens were trained on between each checkpoint please? If you trained on the entire training data I guess this would be the training data size divided by the number of checkpoints.

Thanks for a very useful resource!

Yes, exactly, the training data size divided by the number of checkpoints.

So here is the full AmberDatasets used exactly in the training, each jsonl file here correspond to one checkpoint here. I haven't count the token size my self exactly, but it should be easy to do it from these files.

https://huggingface.co/datasets/LLM360/AmberDatasets/tree/main/train

Sign up or log in to comment