LoRA-only GGUF

#4
by ngxson HF staff - opened

Hi @mlabonne and thank you for your work.

In case you didn't know, I recently did a refactoring on llama.cpp to improve support for LoRA adapters. A script was also added to convert from PEFT model to GGUF.

It would be nice if abliterated LoRA can have GGUF version. The benefit would be to reduce the distributed model size. For example, an adapter rank=32 for llama-3 (in f16) weights only 176MB see here, the q8_0 is only half of that.

I'd be happy to help if you need. Thank you.

Owner

Hey @ngxson I've never quantized a LoRA adapter with GGUF/llama.cpp but here is the repo with the LoRA adapter only: https://huggingface.co/mlabonne/Llama-3-70B-Instruct-abliterated-LORA

I'm a bit confused: the link above is llama-3, not llama-3.1 right?

In anyway, the format looks good (although I'm not having a good bandwidth to download the base model - Will try this later)

In the near future, I'll make something like gguf-my-repo but for converting lora, hopefully that will simplify the conversion.

Owner

Yes, it is. I provide more details in the model card.

That'd be cool!

Hey @mlabonne , the mentioned tool to convert from PEFT model to GGUF is here: https://huggingface.co/blog/ngxson/gguf-my-lora

Could you give it a try? Thank you!

Sign up or log in to comment