Edit model card

h2o-danube3-4b-chat-GGUF

Description

This repo contains GGUF format model files for h2o-danube3-4b-chat quantized using llama.cpp framework.

Table below summarizes different quantized versions of h2o-danube3-4b-chat. It shows the trade-off between size, speed and quality of the models.

Name Quant method Model size MT-Bench AVG Perplexity Tokens per second
h2o-danube3-4b-chat-F16.gguf F16 7.92 GB 6.43 6.17 479
h2o-danube3-4b-chat-Q8_0.gguf Q8_0 4.21 GB 6.49 6.17 725
h2o-danube3-4b-chat-Q6_K.gguf Q6_K 3.25 GB 6.37 6.20 791
h2o-danube3-4b-chat-Q5_K_M.gguf Q5_K_M 2.81 GB 6.25 6.24 927
h2o-danube3-4b-chat-Q4_K_M.gguf Q4_K_M 2.39 GB 6.31 6.37 967
h2o-danube3-4b-chat-Q3_K_M.gguf Q3_K_M 1.94 GB 5.87 6.99 1099
h2o-danube3-4b-chat-Q2_K.gguf Q2_K 1.51 GB 3.71 9.42 1299

Columns in the table are:

  • Name -- model name and link
  • Quant method -- quantization method
  • Model size -- size of the model in gigabytes
  • MT-Bench AVG -- MT-Bench benchmark score. The score is from 1 to 10, the higher, the better
  • Perplexity -- perplexity metric on WikiText-2 dataset. It's reported in a perplexity test from llama.cpp. The lower, the better
  • Tokens per second -- generation speed in tokens per second, as reported in a perplexity test from llama.cpp. The higher, the better. Speed tests are done on a single H100 GPU

Prompt template

<|prompt|>Why is drinking water so healthy?</s><|answer|>
Downloads last month
2,035
GGUF
Model size
3.96B params
Architecture
llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including h2oai/h2o-danube3-4b-chat-GGUF