Post
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
UCLA-AGI has proven that large language models, even weaker large language models, can improve themselves with data only produced by original model. The question they answer in their paper is:
"Can we empower a weak LLM to improve itself without acquiring additional human annotated data?"
They answer this question by the proposal and testing of a novel fine-tuning method they call Self-Play fIne-tuNing (SPIN). The process starts by applying a supervised fine-tune (SFT) to zephyr-7b using all 200k samples of HuggingfaceH4/ultrachat_200k to eliminate the need for a human annotator.
Once the model has completed SFT, the SPIN method suggests generating 50k samples of synthetic data pairs of 'chosen' and 'rejected' samples. The model will be fine tuned on those generations, and this process will repeat for another 3 iterations for a total 200k samples.
This experiment is unique because they propose that their method can yield upwards of 10% performance gains without using any additional human annotated data. The strategy was designed to improve the less strong language models, but with further experimentation could be a formidable strategy for improving language models.
If you would like to explore this strategy for yourself, here are some resources:
Colab: https://colab.research.google.com/drive/1IjDeNVBsRru2-hM_9aauD6gVI-VvJnpk?usp=sharing
Github: https://github.com/uclaml/SPIN
The product of the experiment: UCLA-AGI/zephyr-7b-sft-full-SPIN-iter3
Paper: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)
UCLA-AGI has proven that large language models, even weaker large language models, can improve themselves with data only produced by original model. The question they answer in their paper is:
"Can we empower a weak LLM to improve itself without acquiring additional human annotated data?"
They answer this question by the proposal and testing of a novel fine-tuning method they call Self-Play fIne-tuNing (SPIN). The process starts by applying a supervised fine-tune (SFT) to zephyr-7b using all 200k samples of HuggingfaceH4/ultrachat_200k to eliminate the need for a human annotator.
Once the model has completed SFT, the SPIN method suggests generating 50k samples of synthetic data pairs of 'chosen' and 'rejected' samples. The model will be fine tuned on those generations, and this process will repeat for another 3 iterations for a total 200k samples.
This experiment is unique because they propose that their method can yield upwards of 10% performance gains without using any additional human annotated data. The strategy was designed to improve the less strong language models, but with further experimentation could be a formidable strategy for improving language models.
If you would like to explore this strategy for yourself, here are some resources:
Colab: https://colab.research.google.com/drive/1IjDeNVBsRru2-hM_9aauD6gVI-VvJnpk?usp=sharing
Github: https://github.com/uclaml/SPIN
The product of the experiment: UCLA-AGI/zephyr-7b-sft-full-SPIN-iter3
Paper: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)