license: llama2
Sample repository
Development Status :: 2 - Pre-Alpha
Developed by MinWoo Park, 2023, Seoul, South Korea. Contact: [email protected].
What is GGML?
GGML is a tensor library for machine learning to enable large models and high performance on commodity hardware. Development of ggml is underway for a more efficient format and new k-quant method, so it is not stable. Read more at GGUF documentation.
Model Weights Offered
Model | Size(GB) | Description | Performance |
---|---|---|---|
jindo-7b-instruct | 12.6 | original model weight | |
jindo-7b-instruct.ggmlv3.f16.bin | 12.5 | model weight converted to ggml f16 format | |
jindo-7b-instruct.ggmlv3.q4_0.bin | 3.73 | 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. | Legacy |
jindo-7b-instruct.ggmlv3.q4_k_m.bin | 3.98 | 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. | Medium, balanced quality |
jindo-7b-instruct.ggmlv3.q5_k_m.bin | 4.67 | 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw. | Large, very low quality loss |
Prompt template: None
{prompt}
Inference
To perform inference using the danielpark/ko-llama-2-jindo-7b-instruct-ggml weights fine-tuned with llama2 on CPU or GPU, you need to set up the appropriate installation and configuration on your system. Please refer to the llama.cpp repository, the langchain's documentation, and follow the guides for various dependency software as needed.
Using LLaMA CPP module in Lang-Chain
$ pip install langchain ctransformers llama-cpp-python
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
CPU
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="./models/jindo-7b-instruct-ggml-model-f16.bin",
input={"temperature": 0.75, "max_length": 2000, "top_p": 1},
callback_manager=callback_manager,
verbose=True,
)
GPU
If the installation with BLAS backend was correct, you will see an BLAS = 1 indicator in model properties.
Two of the most important parameters for use with GPU are:
- n_gpu_layers - determines how many layers of the model are offloaded to your GPU.
- n_batch - how many tokens are processed in parallel.
n_gpu_layers = 40 # Change this value based on your model and your GPU VRAM pool.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="./models/jindo-7b-instruct-ggml-model-f16.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True,
)
Metal
n_gpu_layers = 1 # Metal set to 1 is enough.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="./models/jindo-7b-instruct-ggml-model-f16.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True,
)
Using C Transformers module in Lang-Chain
from langchain.llms import CTransformers
llm = CTransformers(model="./models/jindo-7b-instruct-ggml-model-f16.bin", model_type='llama')
print(llm('LLM Jindo is going to'))
Web Demo
I implement the web demo using several popular tools that allow us to rapidly create web UIs.
model | web ui | quantinized |
---|---|---|
danielpark/ko-llama-2-jindo-7b-instruct. | using gradio on colab | - |
danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq | using text-generation-webui on colab | gptq |
danielpark/ko-llama-2-jindo-7b-instruct-ggml | koboldcpp-v1.38 | ggml |
Tools
See more...
Name | Description |
---|---|
KoboldCpp | A powerful GGML web UI with full GPU acceleration out of the box. Especially good for story-telling. |
LoLLMS Web UI | A great web UI with GPU acceleration via the c_transformers backend. |
LM Studio | A fully featured local GUI. Supports full GPU acceleration on macOS. Also supports Windows, without GPU accel. |
text-generation-webui | The most popular web UI. Requires extra steps to enable GPU accel via the llama.cpp backend. |
ctransformers | A Python library with LangChain support and OpenAI-compatible AI server. |
llama-cpp-python | A Python library with OpenAI-compatible API server. |
CLI Inference using Quntinized Weight
To use the program with the desired settings, execute the following command:
./main -t <number_of_cpu_cores> -ngl <number_of_layers_to_offload> -m ko-llama-2-jindo-7b-instruct-ggml.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
Please make the following changes:
- Replace
<number_of_cpu_cores>
with the number of physical CPU cores you have. For example, if your system has 8 cores/16 threads, use-t 8
. - Replace
<number_of_layers_to_offload>
with the number of layers to offload to the GPU. If you don't have GPU acceleration, you can remove the-ngl
argument. - If you want to have a chat-style conversation, replace the
-p "<PROMPT>"
argument with-i -ins
.
Check for more details at llama.cpp, llama-cpp-python, llama2.c
See more...
Quant Types
Quantization Type | Description | Bits per Weight (bpw) |
---|---|---|
GGML_TYPE_Q2_K | "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Block scales and mins are quantized with 4 bits. | 2.5625 |
GGML_TYPE_Q3_K | "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. | 3.4375 |
GGML_TYPE_Q4_K | "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. | 4.5 |
GGML_TYPE_Q5_K | "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw. | 5.5 |
GGML_TYPE_Q6_K | "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. | 6.5625 |
GGML_TYPE_Q8_K | "type-0" 8-bit quantization. Only used for quantizing intermediate results. Block size is 256. All 2-6 bit dot products are implemented for this quantization type. | Not specified |
Model | Description | Recommendation |
---|---|---|
Q4_0 | Small, very high-quality loss | Legacy, prefer Q3_K_M |
Q4_1 | Small, substantial quality loss | Legacy, prefer Q3_K_L |
Q5_0 | Medium, balanced quality | Legacy, prefer Q4_K_M |
Q5_1 | Medium, low quality loss | Legacy, prefer Q5_K_M |
Q2_K | Smallest, extreme quality loss | Not recommended |
Q3_K | Alias for Q3_K_M | |
Q3_K_S | Very small, very high-quality loss | |
Q3_K_M | Very small, very high-quality loss | |
Q3_K_L | Small, substantial quality loss | |
Q4_K | Alias for Q4_K_M | |
Q4_K_S | Small, significant quality loss | |
Q4_K_M | Medium, balanced quality | Recommended |
Q5_K | Alias for Q5_K_M | |
Q5_K_S | Large, low quality loss | Recommended |
Q5_K_M | Large, very low quality loss | Recommended |
Q6_K | Very large, extremely low quality loss | |
Q8_0 | Very large, extremely low quality loss | Not recommended |
F16 | Extremely large, virtually no quality loss | Not recommended |
F32 | Absolutely huge, lossless | Not recommended |
Performance
LLaMA 2 / 7B
name | +ppl | +ppl 13b to 7b % | size | size 16bit % | +ppl per -1G |
---|---|---|---|---|---|
q2_k | 0.8698 | 133.344% | 2.67GB | 20.54% | 0.084201 |
q3_ks | 0.5505 | 84.394% | 2.75GB | 21.15% | 0.053707 |
q3_km | 0.2437 | 37.360% | 3.06GB | 23.54% | 0.024517 |
q3_kl | 0.1803 | 27.641% | 3.35GB | 25.77% | 0.018684 |
q4_0 | 0.2499 | 38.311% | 3.50GB | 26.92% | 0.026305 |
q4_1 | 0.1846 | 28.300% | 3.90GB | 30.00% | 0.020286 |
q4_ks | 0.1149 | 17.615% | 3.56GB | 27.38% | 0.012172 |
q4_km | 0.0535 | 8.202% | 3.80GB | 29.23% | 0.005815 |
q5_0 | 0.0796 | 12.203% | 4.30GB | 33.08% | 0.009149 |
q5_1 | 0.0415 | 6.362% | 4.70GB | 36.15% | 0.005000 |
q5_ks | 0.0353 | 5.412% | 4.33GB | 33.31% | 0.004072 |
q5_km | 0.0142 | 2.177% | 4.45GB | 34.23% | 0.001661 |
q6_k | 0.0044 | 0.675% | 5.15GB | 39.62% | 0.000561 |
q8_0 | 0.0004 | 0.061% | 6.70GB | 51.54% | 0.000063 |
LLaMA 2 / 13B
name | +ppl | +ppl 13b to 7b % | size | size 16bit % | +ppl per -1G |
---|---|---|---|---|---|
q2_k | 0.6002 | 92.013% | 5.13GB | 20.52% | 0.030206 |
q3_ks | 0.3490 | 53.503% | 5.27GB | 21.08% | 0.017689 |
q3_km | 0.1955 | 29.971% | 5.88GB | 23.52% | 0.010225 |
q3_kl | 0.1520 | 23.302% | 6.45GB | 25.80% | 0.008194 |
q4_0 | 0.1317 | 20.190% | 6.80GB | 27.20% | 0.007236 |
q4_1 | 0.1065 | 16.327% | 7.60GB | 30.40% | 0.006121 |
q4_ks | 0.0861 | 13.199% | 6.80GB | 27.20% | 0.004731 |
q4_km | 0.0459 | 7.037% | 7.32GB | 29.28% | 0.002596 |
q5_0 | 0.0313 | 4.798% | 8.30GB | 33.20% | 0.001874 |
q5_1 | 0.0163 | 2.499% | 9.10GB | 36.40% | 0.001025 |
q5_ks | 0.0242 | 3.710% | 8.36GB | 33.44% | 0.001454 |
q5_km | 0.0095 | 1.456% | 8.60GB | 34.40% | 0.000579 |
q6_k | 0.0025 | 0.383% | 9.95GB | 39.80% | 0.000166 |
q8_0 | 0.0005 | 0.077% | 13.00GB | 52.00% | 0.000042 |
Reference Model Cards
The model card of the repository TheBloke/Llama-2-13B-GGML where LLaMA2 has been converted to GGML.
llama.cpp
pull request #1687 for quantinized weight performance.
Note
- Simply download the single GGML file format weight. The other files are for reference purposes only during development. After conducting several experiments, we will provide the final GGML weight file separately.