Can you make a 2.4bpw quantization?

by xldistance - opened Jan 8

Discussion

xldistance

Jan 8

•

edited Jan 8

Thanks for quantifying the model

mo137

Jan 8

I think 2.8 bpw might fit in 24 GB VRAM, but I'm not able to load 3.0 bpw.

xldistance

Jan 8

I think 2.8 bpw might fit in 24 GB VRAM, but I'm not able to load 3.0 bpw.

You can modify config.json's max_position_embeddings to 10000 and then you can use it under 3.0bpw, but the reply speed is only about 3 tokens/s, very slow!

xldistance

Jan 9

2.65bpw quantization set max_position_embeddings to 10000, occupy more than 24GB of video memory, 4090 graphics card with very bad

LoneStriker

Owner Jan 9

I generally just take the original models' configurations. You can edit the file locally if you need it different than the base.

xldistance

Jan 16

extremely grateful

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment