Tanvir1337
/

BanglaLLama-3-8b-BnWiki-Base-GGUF

Text Generation

large language model

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

BanglaLLama-3-8b-BnWiki-Base-GGUF / README.md

Tanvir1337's picture

init readme contents

4a91eaf verified 3 months ago

|

history blame contribute delete

2.13 kB

	---
	license: llama3
	base_model: BanglaLLM/BanglaLLama-3-8b-BnWiki-Base
	datasets:
	- wikimedia/wikipedia
	language:
	- bn
	- en
	tags:
	- bangla
	- large language model
	- text-generation-inference
	- transformers
	library_name: transformers
	pipeline_tag: text-generation
	quantized_by: Tanvir1337
	---

	# Tanvir1337/BanglaLLama-3-8b-BnWiki-Base-GGUF

	This model has been quantized using [llama.cpp](https://github.com/ggerganov/llama.cpp/), a high-performance inference engine for large language models.

	## System Prompt Format

	To interact with the model, use the following prompt format:
	```
	{System}
	### Prompt:
	{User}
	### Response:
	```

	## Usage Instructions

	If you're new to using GGUF files, refer to [TheBloke's README](https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF) for detailed instructions.

	## Quantization Options

	The following graph compares various quantization types (lower is better):

	![image.png](https://www.nethype.de/huggingface_embed/quantpplgraph.png)

	For more information on quantization, see [Artefact2's notes](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9).

	## Choosing the Right Model File

	To select the optimal model file, consider the following factors:

	1. Memory constraints: Determine how much RAM and/or VRAM you have available.
	2. Speed vs. quality: If you prioritize speed, choose a model that fits within your GPU's VRAM. For maximum quality, consider a model that fits within the combined RAM and VRAM of your system.

	Quantization formats:

	* K-quants (e.g., Q5_K_M): A good starting point, offering a balance between speed and quality.
	* I-quants (e.g., IQ3_M): Newer and more efficient, but may require specific hardware configurations (e.g., cuBLAS or rocBLAS).

	Hardware compatibility:

	* I-quants: Not compatible with Vulcan (AMD). If you have an AMD card, ensure you're using the rocBLAS build or a compatible inference engine.

	For more information on the features and trade-offs of each quantization format, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix).