could you share about pre-trained model?
Hi, thanks for sharing great work!
What kind of pre-trained model did you use?
I would like to know whether you used some japanese corpus in the stage of pre-training.
If you used japanese corpus, It would be glad if let me know the how much token did you trained.
Here's the model I pre-trained: NilanE/tinyllama-relora-merge
However, it's not particularly good. You'd likely get similar results if you did an SFT run on top of base tinyllama.
It was trained for about 6 hours total on an A5000 using relora and axolotl, which is a miserably small amount.
The dataset is ~400mb (not sure about the token count) of English and Japanese fanfiction (I added the English fanfiction to avoid catastrophic forgetting. It makes up about 1/8th of the total).
The dataset is based on RyokoAI/Syosetu711K for the Japanese portion and RyokoAI/ScribbleHub17K for the English. I did some quality filtering and regex stuff, but not very much overall.
The pre-training is the weakest link in the chain though, and I believe it's holding the final model back by a lot. If I had the funds to do it again, I'd use a lot more data and add in a lot of English literature to teach the model creative writing, to help with the too-literal translations it makes, among other things.
Also, check out NilanE/tinyllama-en_ja-translation-v3. It's massively improved over v2 in every way (stilll uses the same base model though)
Thanks for sharing the training details!! I completely got it.
To me, in same case as you like GPU resource and corpus, I might take same actions too.
I found your v3 model after leaving this message. Great work!