Could this be further quantized to a GGUF-file?
Thanks for this!
Could this be further quantized to GGUF-file? Have you tested it?
tldr: llama.cpp doesn't support it yet.
I'm just a random user, and I have no idea how LLMs work internally.
I downloaded some other much smaller repo from them with suffix bnb-8bit-smashed
(not bnb-4bit-smashed
like here).
I ran convert.py
and got this:
File "/home/arzeth/llama.cpp-cuda/./convert.py", line 940, in convert
data_type = SAFETENSORS_DATA_TYPES[info['dtype']]
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'I8'
So I did this (I'm not sure if it's correct):
diff --git a/convert.py b/convert.py
index 24df0a4d..b9bb6bd7 100755
--- a/convert.py
+++ b/convert.py
@@ -69,6 +69,7 @@ class UnquantizedDataType(DataType):
DT_F16 = UnquantizedDataType('F16', dtype = np.dtype(np.float16), valid_conversions = ['F32', 'Q8_0'])
DT_F32 = UnquantizedDataType('F32', dtype = np.dtype(np.float32), valid_conversions = ['F16', 'Q8_0'])
+DT_I8 = UnquantizedDataType('I8', dtype = np.dtype(np.int8), valid_conversions = ['Q8_0'])
DT_I32 = UnquantizedDataType('I32', dtype = np.dtype(np.int16), valid_conversions = [])
DT_BF16 = UnquantizedDataType('BF16', dtype = np.dtype(np.uint16), valid_conversions = ['F32', 'F16', 'Q8_0'])
@@ -113,7 +114,7 @@ DT_Q8_0 = Q8_0QuantizedDataType('Q8_0',
# Quantized types skipped here because they may also map to np.float32
NUMPY_TYPE_TO_DATA_TYPE: dict[np.dtype[Any], DataType] = {}
-for dt in (DT_BF16, DT_F16, DT_F32, DT_I32):
+for dt in (DT_BF16, DT_F16, DT_F32, DT_I8, DT_I32):
if dt.dtype in NUMPY_TYPE_TO_DATA_TYPE:
raise ValueError(f'Invalid duplicate data type {dt}')
NUMPY_TYPE_TO_DATA_TYPE[dt.dtype] = dt
@@ -122,6 +123,7 @@ SAFETENSORS_DATA_TYPES: dict[str, DataType] = {
'BF16': DT_BF16,
'F16': DT_F16,
'F32': DT_F32,
+ 'I8': DT_I8,
'I32': DT_I32,
}
@@ -1236,7 +1238,7 @@ def pick_output_type(model: LazyModel, output_type_str: str | None) -> GGMLFileT
return GGMLFileType.AllF32
if output_type_str == "f16" or (output_type_str is None and wq_type == DT_F16):
return GGMLFileType.MostlyF16
- if output_type_str == "q8_0":
+ if output_type_str == "q8_0" or (output_type_str is None and wq_type == DT_I8):
return GGMLFileType.MostlyQ8_0
name_to_type = {name: lazy_tensor.data_type for (name, lazy_tensor) in model.items()}
Then I got FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']
, so I added --vocab-type bpe
But then I got ValueError: Unexpected tensor name: model.layers.0.mlp.down_proj.SCB. Use --skip-unknown to ignore it (e.g. LLaVA)
I added --skip-unknown
, it created a .gguf, but it of course wrote """"""""
when I asked 2+2=?
because there were
Unexpected tensor name: model.layers.21.mlp.down_proj.SCB - skipping
Unexpected tensor name: model.layers.21.mlp.gate_proj.SCB - skipping
Unexpected tensor name: model.layers.21.mlp.up_proj.SCB - skipping
Unexpected tensor name: model.layers.21.self_attn.k_proj.SCB - skipping
Unexpected tensor name: model.layers.21.self_attn.o_proj.SCB - skipping
Unexpected tensor name: model.layers.21.self_attn.q_proj.SCB - skipping
Unexpected tensor name: model.layers.21.self_attn.v_proj.SCB - skipping
for several layers during the gguf creation.
This is not supported yet but we are wokring on it (see: https://huggingface.co/PrunaAI/Mixtral-8x22B-v0.1-bnb-4bit-smashed/discussions/2#6617951a6b30e90346acabc2) ;)