Loading AutoTokenizer and AutoModelForCausalLM
First of all thank you for converting this to GGUF, it been really massive on my llm learning journey.
I already have this working with LlamaCpp, but I found this post and apparently I can ran these directly on the transformers library.
But when I try to run it with Auto classes, I found the following issues:
AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
model_id, gguf_file=filename, low_cpu_mem_usage=True
)
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BartTokenizer'.
The class this function is called from is 'GPT2TokenizerFast'.
Traceback (most recent call last):
File "test.py", line 11, in <module>
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 899, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2110, in from_pretrained
return cls._from_pretrained(
^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2336, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 100, in __init__
super().__init__(
File "./lib/python3.12/site-packages/transformers/tokenization_utils_fast.py", line 120, in __init__
tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 81, in load_gguf_checkpoint
reader = GGUFReader(gguf_checkpoint_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/gguf/gguf_reader.py", line 85, in __init__
self.data = np.memmap(path, mode = mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/numpy/core/memmap.py", line 229, in __new__
f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType
AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_id, gguf_file=filename, low_cpu_mem_usage=True
)
./lib/python3.12/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Converting and de-quantizing GGUF tensors...: 0%| | 0/291 [00:00<?, ?it/s]
Converting and de-quantizing GGUF tensors...: 100%|██████████| 291/291 [00:29<00:00, 9.96it/s]
Traceback (most recent call last):
File "test.py", line 11, in <module>
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./lib/python3.12/site-packages/transformers/modeling_utils.py", line 4059, in _load_pretrained_model
raise ValueError(
ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?
It looks like the binary is missing some configuration? Am I supposed to provide it in some way or am I just dumb and this isn't supported?
I will really appreciate any help! and again thank you for the GGUF version!
Fascinating.. I've never heard of this before, thanks for sharing!
I assume that you're using one of the supported types?
For what it's worth, it LOOKS like the only real purpose of this is the get an "unquantized" version that you can use elsewhere, is that correct? "Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it with a plethora of other tools."
if so, are you just doing this as an experiment for fun or is there a reason you don't want to use the original safetensors variant?
Fascinating.. I've never heard of this before, thanks for sharing!
I assume that you're using one of the supported types?
For what it's worth, it LOOKS like the only real purpose of this is the get an "unquantized" version that you can use elsewhere, is that correct? "Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it with a plethora of other tools."
if so, are you just doing this as an experiment for fun or is there a reason you don't want to use the original safetensors variant?
would the unquantized full version from the gguf be as precise as using the full weights to begin with though? I guess this would only be useful for stuff like the leaked miqu that only came in gguf, but with access to the original weights I don't really see the point
Yeah I think this is just for the case where you only have access to the GGUF and want the full safetensors, and likely won't be as accurate
If you download an f16/f32/bf16 GGUF then you'll have the same accuracy as the full safetensors already
I assume that you're using one of the supported types?
Yes, was using Q8_0
if so, are you just doing this as an experiment for fun or is there a reason you don't want to use the original safetensors variant?
I just wanted to do inference through transformers library, mainly to use the tokenizer with langchain. I just was unaware of what quantization meant, now I know. Thanks!
BTW, you are missing the template Mathematical Reasoning Mode -> Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>Math Correct Assistant:
.
Not that I care about it, but it could be useful to someone else. Looking at the python-llama-cpp
code, I noticed the server is ready to pull the Jinja2
template from metadata and use it with just a command argument.
--chat_format CHAT_FORMAT
Chat format to use.