Post
Gemma QLoRA finetuning is now 2.4x faster and uses 58% less VRAM than FA2 through 🦥Unsloth! Had to rewrite our Cross Entropy Loss kernels to work on all vocab sizes, re-design our manual autograd engine to accept all activation functions, and more! I wrote all about our learnings in our blog post: https://unsloth.ai/blog/gemma.
Also have a Colab notebook with no OOMs, and has 2x faster inference for Gemma & how to merge and save to llama.cpp GGUF & vLLM: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing
And uploaded 4bit pre-quantized versions for Gemma 2b and 7b: unsloth/gemma-7b-bnb-4bit unsloth/gemma-2b-bnb-4bit
Also have a Colab notebook with no OOMs, and has 2x faster inference for Gemma & how to merge and save to llama.cpp GGUF & vLLM: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing
And uploaded 4bit pre-quantized versions for Gemma 2b and 7b: unsloth/gemma-7b-bnb-4bit unsloth/gemma-2b-bnb-4bit
from unsloth import FastLanguageModel
model, tokenzer = FastLanguageModel.from_pretrained("unsloth/gemma-7b")
model = FastLanguageModel.get_peft_model(model)