LoRA-only GGUF
Hi @mlabonne and thank you for your work.
In case you didn't know, I recently did a refactoring on llama.cpp to improve support for LoRA adapters. A script was also added to convert from PEFT model to GGUF.
It would be nice if abliterated LoRA can have GGUF version. The benefit would be to reduce the distributed model size. For example, an adapter rank=32 for llama-3 (in f16) weights only 176MB see here, the q8_0 is only half of that.
I'd be happy to help if you need. Thank you.
Hey @ngxson I've never quantized a LoRA adapter with GGUF/llama.cpp but here is the repo with the LoRA adapter only: https://huggingface.co/mlabonne/Llama-3-70B-Instruct-abliterated-LORA
I'm a bit confused: the link above is llama-3, not llama-3.1 right?
In anyway, the format looks good (although I'm not having a good bandwidth to download the base model - Will try this later)
In the near future, I'll make something like gguf-my-repo but for converting lora, hopefully that will simplify the conversion.
Yes, it is. I provide more details in the model card.
That'd be cool!
Hey @mlabonne , the mentioned tool to convert from PEFT model to GGUF is here: https://huggingface.co/blog/ngxson/gguf-my-lora
Could you give it a try? Thank you!