Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
BramVanroy 
posted an update Mar 23
Post
2391
Does anyone have experience with finetuning Gemma? Even the 2B variant feels more memory heavy than mistral 7B. I know that its vocabulary is much larger (250k) but I'm a bit surprised that the max batch size that I can get in an A100 80GB is only 2 whereas I could fit 4 with mistral 7B - even though Gemma is much smaller except for the embedding layer. Both runs were using FA, same sequence length, same deepspeed zero 3 settings. Oh and yes I'm using the most recent hot fix of transformers that solves a memory issue with Gemma and others.

Any prior experience that you can share or suggestions to improve throughout?

I got some weird results, since there are a lot of other models in that performance-parameter range, I just didn't try anymore.

·

What kind of weird results? In terms of loss, or really qualitative output?