Curious about Fine-tuning Methods
Hello there, just a simple curiosity here, what were the factors that made it so that you chose (or ended up with) training for 2 epochs over your data?
Regarding the number of epoch specifically, it's mostly based on my other trains of other models using this same dataset. The eval loss also flatlined (though might go lower for at least 1 more epoch) and the base model is quite amazing and I did not want to overwrite all of its other capabilities so to speak (even though my dataset is diverse, and has data beyond just writing).
So TL;DR: mostly vibes -- I can't afford to test things properly atm (do a sweep of different params, and compare based on end-to-end side-by-side comparison).
Regarding the number of epoch specifically, it's mostly based on my other trains of other models using this same dataset. The eval loss also flatlined (though might go lower for at least 1 more epoch) and the base model is quite amazing and I did not want to overwrite all of its other capabilities so to speak (even though my dataset is diverse, and has data beyond just writing).
So TL;DR: mostly vibes -- I can't afford to test things properly atm (do a sweep of different params, and compare based on end-to-end side-by-side comparison).
Definitely agree on the quality of the base model, the thought had occured that too much fine-tuning might lose some of its best aspects such as instruction following. Looking forward to seeing potential improvements in the future but totally understand the cost of training. Thanks for getting it out so quickly!
Why does train/loss stop decreasing after reaching a certain stage?