Can you make a 2.4bpw quantization?
#1
by
xldistance
- opened
Thanks for quantifying the model
I think 2.8 bpw might fit in 24 GB VRAM, but I'm not able to load 3.0 bpw.
I think 2.8 bpw might fit in 24 GB VRAM, but I'm not able to load 3.0 bpw.
You can modify config.json's max_position_embeddings to 10000 and then you can use it under 3.0bpw, but the reply speed is only about 3 tokens/s, very slow!
2.65bpw quantization set max_position_embeddings to 10000, occupy more than 24GB of video memory, 4090 graphics card with very bad
I generally just take the original models' configurations. You can edit the file locally if you need it different than the base.
extremely grateful