@BramVanroy on Hugging Face: "Does anyone have experience with finetuning Gemma? Even the 2B variant feels…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

BramVanroy

posted an update Mar 23

Post

2391

Does anyone have experience with finetuning Gemma? Even the 2B variant feels more memory heavy than mistral 7B. I know that its vocabulary is much larger (250k) but I'm a bit surprised that the max batch size that I can get in an A100 80GB is only 2 whereas I could fit 4 with mistral 7B - even though Gemma is much smaller except for the embedding layer. Both runs were using FA, same sequence length, same deepspeed zero 3 settings. Oh and yes I'm using the most recent hot fix of transformers that solves a memory issue with Gemma and others.

Any prior experience that you can share or suggestions to improve throughout?

KnutJaegersberg

Mar 23

I got some weird results, since there are a lot of other models in that performance-parameter range, I just didn't try anymore.

BramVanroy

Mar 23

What kind of weird results? In terms of loss, or really qualitative output?

In this post