|
--- |
|
language: |
|
- en |
|
license: mit |
|
tags: |
|
- meta |
|
- pytorch |
|
- llama-3.1 |
|
- llama-3.1-instruct |
|
- gguf |
|
model_name: Llama-3.1-70B-Instruct-GGUF |
|
arxiv: 2407.21783 |
|
base_model: meta-llama/Llama-3.1-70b-instruct.hf |
|
inference: false |
|
model_creator: Meta Llama 3.1 |
|
model_type: llama |
|
pipeline_tag: text-generation |
|
prompt_template: > |
|
[INST] <<SYS>> |
|
|
|
You are a helpful, respectful and honest assistant. Always answer as helpfully |
|
as possible.If a question does not make any sense, or is not factually |
|
coherent, explain why instead of answering something that is not correct. If |
|
you don't know the answer to a question, do not answer it with false |
|
information. |
|
|
|
<</SYS>> |
|
|
|
{prompt}[/INST] |
|
quantized_by: hierholzer |
|
--- |
|
|
|
[![Hierholzer Banner](https://tvtime.us/static/images/LLAMA3.1.jpg)](#) |
|
|
|
# GGUF Model |
|
----------------------------------- |
|
|
|
|
|
Here are Quantized versions of Llama-3.1-70B-Instruct using GGUF |
|
|
|
|
|
## 🤔 What Is GGUF |
|
GGUF is designed for use with GGML and other executors. |
|
GGUF was developed by @ggerganov who is also the developer of llama.cpp, a popular C/C++ LLM inference framework. |
|
Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines. |
|
|
|
|
|
## ☑️Uploaded Quantization Types |
|
|
|
Here are the quantized versions that I have available: |
|
|
|
- [ ] Q2_K |
|
- [x] Q3_K_S |
|
- [x] Q3_K_M |
|
- [x] Q3_K_L |
|
- [x] Q4_K_S |
|
- [x] Q4_K_M ~ *Recommended* |
|
- [x] Q5_K_S ~ *Recommended* |
|
- [x] Q5_K_M ~ *Recommended* |
|
- [ ] Q6_K |
|
- [ ] Q8_0 ~ *NOT Recommended* |
|
- [ ] F16 ~ *NOT Recommended* |
|
- [ ] F32 ~ *NOT Recommended* |
|
|
|
Feel Free to reach out to me if you need a specific Quantization Type that I do not currently offer. |
|
|
|
|
|
### 📈All Quantization Types Possible |
|
Below is a table of all the Quantication Types that are possible as well as short descriptions. |
|
|
|
| **#** | **or** | **Q#** | **:** | _Description Of Quantization Types_ | |
|
|-------|:------:|:------:|:-----:|----------------------------------------------------------------| |
|
| 2 | or | Q4_0 | : | small, very high quality loss - legacy, prefer using Q3_K_M | |
|
| 3 | or | Q4_1 | : | small, substantial quality loss - legacy, prefer using Q3_K_L | |
|
| 8 | or | Q5_0 | : | medium, balanced quality - legacy, prefer using Q4_K_M | |
|
| 9 | or | Q5_1 | : | medium, low quality loss - legacy, prefer using Q5_K_M | |
|
| 10 | or | Q2_K | : | smallest, extreme quality loss - *NOT Recommended* | |
|
| 12 | or | Q3_K | : | alias for Q3_K_M | |
|
| 11 | or | Q3_K_S | : | very small, very high quality loss | |
|
| 12 | or | Q3_K_M | : | very small, high quality loss | |
|
| 13 | or | Q3_K_L | : | small, high quality loss | |
|
| 15 | or | Q4_K | : | alias for Q4_K_M | |
|
| 14 | or | Q4_K_S | : | small, some quality loss | |
|
| 15 | or | Q4_K_M | : | medium, balanced quality - *Recommended* | |
|
| 17 | or | Q5_K | : | alias for Q5_K_M | |
|
| 16 | or | Q5_K_S | : | large, low quality loss - *Recommended* | |
|
| 17 | or | Q5_K_M | : | large, very low quality loss - *Recommended* | |
|
| 18 | or | Q6_K | : | very large, very low quality loss | |
|
| 7 | or | Q8_0 | : | very large, extremely low quality loss | |
|
| 1 | or | F16 | : | extremely large, virtually no quality loss - *NOT Recommended* | |
|
| 0 | or | F32 | : | absolutely huge, lossless - *NOT Recommended* | |
|
|
|
## 💪 Benefits of using GGUF |
|
|
|
By using a GGUF version of Llama-3.1-70B-Instruct, you will be able to run this LLM while having to use significantly less resources than you would using the non quantized version. |
|
This also allows you to run this 70B Model on a machine with less memory than a non quantized version. |
|
|
|
|
|
## ⚙️️Installation |
|
-------------------------------------------- |
|
Here are 2 different methods you can use to run the quantized versions of Llama-3.1-70B-Instruct |
|
|
|
### 1️⃣ Text-generation-webui |
|
|
|
Text-generation-webui is a web UI for Large Language Models that you can run locally. |
|
|
|
#### ☑️ How to install Text-generation-webui |
|
*If you already have Text-generation-webui then skip this section* |
|
|
|
| # | Download Text-generation-webui | |
|
|----|------------------------------------------------------------------------------------------------------------------| |
|
| 1. | Clone the text-generation-webui repository from Github by copying the git clone snippet below: | |
|
```shell |
|
git clone https://github.com/oobabooga/text-generation-webui.git |
|
``` |
|
| # | Install Text-generation-webui | |
|
|----|------------------------------------------------------------------------------------------------------------------| |
|
| 1. | Run the `start_linux.sh`, `start_windows.bat`, `start_macos.sh`, or `start_wsl.bat` script depending on your OS. | |
|
| 2. | Select your GPU vendor when asked. | |
|
| 3. | Once the installation script ends, browse to `http://localhost:7860`. | |
|
|
|
#### ✅Using Llama-3.1-70B-Instruct-GGUF with Text-generation-webui |
|
| # | Using Llama-3.1-70B-Instruct-GGUF with Text-generation-webui | |
|
|----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
| 1. | Once you are running text-generation-webui in your browser, click on the 'Model' Tab at the top of your window. | |
|
| 2. | In the Download Model section, you need to enter the model repo: *hierholzer/Llama-3.1-70B-Instruct-GGUF* and below it, the specific filename to download, such as: *Llama-3.1-70B-Instruct-Q4_K_M.gguf* | |
|
| 3. | Click Download and wait for the download to complete. NOTE: you can see the download progress back in your terminal window. | |
|
| 4. | Once the download is finished, click the blue refresh icon within the Model tab that you are in. | |
|
| 5. | Select your newly downloaded GGUF file in the Model drop-down. once selected, change the settings to best match your system. | |
|
|
|
### 2️⃣ Ollama |
|
Ollama runs as a local service. |
|
Although it technically works using a command-line interface, Ollama's best attribute is their REST API. |
|
Being able to utilize your locally ran LLMs through the use of this API can give you almost endless possibilities! |
|
*Feel free to reach out to me if you would like to know some examples that I use this API for* |
|
|
|
#### ☑️ How to install Ollama |
|
Go To the URL below, and then select which OS you are using |
|
```shell |
|
https://ollama.com/download |
|
``` |
|
Using Windows, or Mac you will then download a file and run it. |
|
If you are using linux it will just provide a single command that you need to run in your terminal window. |
|
*Thats about it for installing Ollama* |
|
#### ✅Using Llama-3.1-70B-Instruct-GGUF with Ollama |
|
Ollama does have a Model Library where you can download models: |
|
```shell |
|
https://ollama.com/library |
|
``` |
|
This Model Library offers all sizes of regular Lama 3.1, as well as the 8B version of Llama 3.1-Instruct. |
|
However, if you would like to use the 70B quantized version of Llama 3.1-Instruct |
|
then you will have to use the following instructions. |
|
| # | Running the 70B quantized version of Llama 3.1-Instruct with Ollama | |
|
|----|----------------------------------------------------------------------------------------------| |
|
| 1. | Download your desired version of in the Files and Versions section of this Model Repository | |
|
| 2. | Next, create a Modelfile configuration that defines the model's behavior. For Example: | |
|
```shell |
|
# Modelfile |
|
FROM "./Llama-3.1-70B-Instruct-Q4_K_M.gguf" |
|
PARAMETER stop "<|im_start|>" |
|
PARAMETER stop "<|im_end|>" |
|
TEMPLATE """ |
|
<|im_start|>system |
|
<|im_end|> |
|
<|im_start|>user |
|
<|im_end|> |
|
<|im_start|>assistant |
|
""" |
|
``` |
|
*Replace ./Llama-3.1-70B-Instruct-Q4_K_M.gguf with the correct version and actual path to the GGUF file you downloaded. |
|
The TEMPLATE line defines the prompt format using system, user, and assistant roles. |
|
You can customize this based on your use case.* |
|
| # | Running the 70B quantized version of Llama 3.1-Instruct with Ollama - *continued* | |
|
|----|-----------------------------------------------------------------------------------| |
|
| 3. | Now, build the Ollama model using the ollama create command: | |
|
```shell |
|
ollama create "Llama-3.1-70B-Instruct-Q4_K_M" -f ./Llama-3.1-70B-Instruct-Q4_K_M.gguf |
|
``` |
|
*Once again Replace the name: Llama-3.1-70B-Instruct-Q4_K_M and the |
|
model: ./Llama-3.1-70B-Instruct-Q4_K_M.gguf with the quantized model you are using.* |
|
| # | Running the 70B quantized version of Llama 3.1-Instruct with Ollama - *continued* | |
|
|----|-----------------------------------------------------------------------------------| |
|
| 4. | You then can run your model using the ollama run command: | |
|
```shell |
|
ollama run Llama-3.1-70B-Instruct-Q4_K_M |
|
``` |
|
|
|
------------------------------------------------- |
|
|
|
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-FFD21E?logo=huggingface&logoColor=000)](#) |
|
[![OS](https://img.shields.io/badge/OS-linux%2C%20windows%2C%20macOS-0078D4)](https://docs.abblix.com/docs/technical-requirements) |
|
[![CPU](https://img.shields.io/badge/CPU-x86%2C%20x64%2C%20ARM%2C%20ARM64-FF8C00)](https://docs.abblix.com/docs/technical-requirements) |
|
[![forthebadge](https://forthebadge.com/images/badges/license-mit.svg)](https://forthebadge.com) |
|
[![forthebadge](https://forthebadge.com/images/badges/made-with-python.svg)](https://forthebadge.com) |
|
[![forthebadge](https://forthebadge.com/images/badges/powered-by-electricity.svg)](https://forthebadge.com) |