metadata
license: llama3
base_model: BanglaLLM/BanglaLLama-3-8b-BnWiki-Base
datasets:
- wikimedia/wikipedia
language:
- bn
- en
tags:
- bangla
- large language model
- text-generation-inference
- transformers
library_name: transformers
pipeline_tag: text-generation
quantized_by: Tanvir1337
Tanvir1337/BanglaLLama-3-8b-BnWiki-Base-GGUF
This model has been quantized using llama.cpp, a high-performance inference engine for large language models.
System Prompt Format
To interact with the model, use the following prompt format:
{System}
### Prompt:
{User}
### Response:
Usage Instructions
If you're new to using GGUF files, refer to TheBloke's README for detailed instructions.
Quantization Options
The following graph compares various quantization types (lower is better):
For more information on quantization, see Artefact2's notes.
Choosing the Right Model File
To select the optimal model file, consider the following factors:
- Memory constraints: Determine how much RAM and/or VRAM you have available.
- Speed vs. quality: If you prioritize speed, choose a model that fits within your GPU's VRAM. For maximum quality, consider a model that fits within the combined RAM and VRAM of your system.
Quantization formats:
- K-quants (e.g., Q5_K_M): A good starting point, offering a balance between speed and quality.
- I-quants (e.g., IQ3_M): Newer and more efficient, but may require specific hardware configurations (e.g., cuBLAS or rocBLAS).
Hardware compatibility:
- I-quants: Not compatible with Vulcan (AMD). If you have an AMD card, ensure you're using the rocBLAS build or a compatible inference engine.
For more information on the features and trade-offs of each quantization format, refer to the llama.cpp feature matrix.