GGUFs
Collection
I take requests, feel free to drop me a line in the community posts
•
40 items
•
Updated
•
2
Important Note: Inferencing in llama.cpp has now been merged in PR #8604. Please ensure you are on release b3438 or newer. Text-generation-web-ui (Ooba) is also working as of 7/23. Kobold.cpp working as of v1.71.
Quantized from mini-magnum-12b-v1.1 fp16
KL-Divergence Reference Chart (Click on image to view in full size)
Quant-specific Tips:
- If you are getting a
cudaMalloc failed: out of memory
error, try passing an argument for lower context in llama.cpp, e.g. for 8k:-c 8192
- If you have all ampere generation or newer cards, you can use flash attention like so:
-fa
- Provided Flash Attention is enabled you can also use quantized cache to save on VRAM e.g. for 8-bit:
-ctk q8_0 -ctv q8_0
Original model card can be found here
Base model
intervitens/mini-magnum-12b-v1.1