@danielhanchen on Hugging Face: "Gemma QLoRA finetuning is now 2.4x faster and uses 58% less VRAM than FA2…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

danielhanchen

posted an update Feb 27

Post

Gemma QLoRA finetuning is now 2.4x faster and uses 58% less VRAM than FA2 through 🦥Unsloth! Had to rewrite our Cross Entropy Loss kernels to work on all vocab sizes, re-design our manual autograd engine to accept all activation functions, and more! I wrote all about our learnings in our blog post: https://unsloth.ai/blog/gemma.

Also have a Colab notebook with no OOMs, and has 2x faster inference for Gemma & how to merge and save to llama.cpp GGUF & vLLM: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing

And uploaded 4bit pre-quantized versions for Gemma 2b and 7b: unsloth/gemma-7b-bnb-4bit unsloth/gemma-2b-bnb-4bit

from unsloth import FastLanguageModel
model, tokenzer = FastLanguageModel.from_pretrained("unsloth/gemma-7b")
model = FastLanguageModel.get_peft_model(model)

wdevexpert

Feb 28

This comment has been hidden

ajibawa-2023

Feb 28

But is the original model worth it?

danielhanchen

Feb 28

Oh Gemma itself? I guess it is 6T tokens and it does do somewhat better than Mistral according to Google's paper :)

In this post