Transformers

You are viewing v4.45.1 version. A newer version v4.46.2 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

GGUF and interaction with Transformers

The GGUF file format is used to store models for inference with GGML and other libraries that depend on it, like the very popular llama.cpp or whisper.cpp.

It is a file format supported by the Hugging Face Hub with features allowing for quick inspection of tensors and metadata within the file.

This file format is designed as a “single-file-format” where a single file usually contains both the configuration attributes, the tokenizer vocabulary and other attributes, as well as all tensors to be loaded in the model. These files come in different formats according to the quantization type of the file. We briefly go over some of them here.

Support within Transformers

We have added the ability to load gguf files within transformers in order to offer further training/fine-tuning capabilities to gguf models, before converting back those models to gguf to use within the ggml ecosystem. When loading a model, we first dequantize it to fp32, before loading the weights to be used in PyTorch.

[!NOTE] The support is still very exploratory and we welcome contributions in order to solidify it across quantization types and model architectures.

For now, here are the supported model architectures and quantization types:

Supported quantization types

The initial supported quantization types are decided according to the popular quantized files that have been shared on the Hub.

F32
F16
BF16
Q4_0
Q4_1
Q5_0
Q5_1
Q8_0
Q2_K
Q3_K
Q4_K
Q5_K
Q6_K
IQ1_S
IQ1_M
IQ2_XXS
IQ2_XS
IQ2_S
IQ3_XXS
IQ3_S
IQ4_XS
IQ4_NL

[!NOTE] To support gguf dequantization, gguf>=0.10.0 installation is required.

Supported model architectures

For now the supported model architectures are the architectures that have been very popular on the Hub, namely:

LLaMa
Mistral
Qwen2
Qwen2Moe
Phi3

Example usage

In order to load gguf files in transformers, you should specify the gguf_file argument to the from_pretrained methods of both tokenizers and models. Here is how one would load a tokenizer and a model, which can be loaded from the exact same file:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it with a plethora of other tools.

In order to convert back to a gguf file, we recommend using the convert-hf-to-gguf.py file from llama.cpp.

Here’s how you would complete the script above to save the model and export it back to gguf:

tokenizer.save_pretrained('directory')
model.save_pretrained('directory')

!python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory}

< > Update on GitHub

←Troubleshoot Interoperability with TikToken files→