metadata

language:
  - en
license: mit
tags:
  - meta
  - pytorch
  - llama-3.1
  - llama-3.1-instruct
  - gguf
model_name: Llama-3.1-70B-Instruct-GGUF
arxiv: 2407.21783
base_model: meta-llama/Llama-3.1-70b-instruct.hf
inference: false
model_creator: Meta Llama 3.1
model_type: llama
pipeline_tag: text-generation
prompt_template: >
  [INST] <<SYS>>

  You are a helpful, respectful and honest assistant. Always answer as helpfully
  as possible.If a question does not make any sense, or is not factually
  coherent,  explain why instead of answering something that is not correct.  If
  you don't know the answer to a question, do not answer it with false
  information.

  <</SYS>>

  {prompt}[/INST]
quantized_by: hierholzer

GGUF Model

Here are Quantized versions of Llama-3.1-70B-Instruct using GGUF

🤔 What Is GGUF

GGUF is designed for use with GGML and other executors. GGUF was developed by @ggerganov who is also the developer of llama.cpp, a popular C/C++ LLM inference framework. Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines.

☑️Uploaded Quantization Types

Here are the quantized versions available:

Q4_K_S
Q4_K_M ~ Recommended
Q5_K_M ~ Recommended
Q8_0 ~ NOT Recommended

Feel Free to reach out to me if you need a specific Quantization Type that I do not currently offer.

📈All Quantization Types Possible

Below is a table of all the Quantication Types that are possible.

#	or	Q#	:	Description Of Quantization Types
2	or	Q4_0	:	small, very high quality loss - legacy, prefer using Q3_K_M
3	or	Q4_1	:	small, substantial quality loss - legacy, prefer using Q3_K_L
8	or	Q5_0	:	medium, balanced quality - legacy, prefer using Q4_K_M
9	or	Q5_1	:	medium, low quality loss - legacy, prefer using Q5_K_M
10	or	Q2_K	:	smallest, extreme quality loss - NOT Recommended
12	or	Q3_K	:	alias for Q3_K_M
11	or	Q3_K_S	:	very small, very high quality loss
12	or	Q3_K_M	:	very small, high quality loss
13	or	Q3_K_L	:	small, high quality loss
15	or	Q4_K	:	alias for Q4_K_M
14	or	Q4_K_S	:	small, some quality loss
15	or	Q4_K_M	:	medium, balanced quality - Recommended
17	or	Q5_K	:	alias for Q5_K_M
16	or	Q5_K_S	:	large, low quality loss - Recommended
17	or	Q5_K_M	:	large, very low quality loss - Recommended
18	or	Q6_K	:	very large, very low quality loss
7	or	Q8_0	:	very large, extremely low quality loss
1	or	F16	:	extremely large, virtually no quality loss - NOT Recommended
0	or	F32	:	absolutely huge, lossless - NOT Recommended

💪 Benefits of using GGUF

By using a GGUF version of Llama-3.1-70B-Instruct, you will be able to run this LLM while having to use significantly less resources than you would using the non quantized version. This also allows you to run this 70B Model on a machine with less memory than a non quantized version.

⚙️️Installation

Here are 2 different methods you can use to run the quantized versions of Llama-3.1-70B-Instruct

1️⃣ Text-generation-webui

Text-generation-webui is a web UI for Large Language Models that you can run locally.

☑️ How to install Text-generation-webui

If you already have Text-generation-webui then skip this section

#	Download Text-generation-webui
1.	Clone the text-generation-webui repository from Github by copying the git clone snippet below:

git clone https://github.com/oobabooga/text-generation-webui.git

#	Install Text-generation-webui
1.	Run the `start_linux.sh`, `start_windows.bat`, `start_macos.sh`, or `start_wsl.bat` script depending on your OS.
2.	Select your GPU vendor when asked.
3.	Once the installation script ends, browse to `http://localhost:7860`.

✅Using Llama-3.1-70B-Instruct-GGUF with Text-generation-webui

#	Using Llama-3.1-70B-Instruct-GGUF with Text-generation-webui
1.	Once you are running text-generation-webui in your browser, click on the 'Model' Tab at the top of your window.
2.	In the Download Model section, you need to enter the model repo: hierholzer/Llama-3.1-70B-Instruct-GGUF and below it, the specific filename to download, such as: Llama-3.1-70B-Instruct-Q4_K_M.gguf
3.	Click Download and wait for the download to complete. NOTE: you can see the download progress back in your terminal window.
4.	Once the download is finished, click the blue refresh icon within the Model tab that you are in.
5.	Select your newly downloaded GGUF file in the Model drop-down. once selected, change the settings to best match your system.

2️⃣ Ollama

Ollama runs as a local service. Although it technically works using a command-line interface, Ollama's best attribute is their REST API. Being able to utilize your locally ran LLMs through the use of this API can give you almost endless possibilities! Feel free to reach out to me if you would like to know some examples that I use this API for

☑️ How to install Ollama

Go To the URL below, and then select which OS you are using

https://ollama.com/download

Using Windows, or Mac you will then download a file and run it. If you are using linux it will just provide a single command that you need to run in your terminal window. Thats about it for installing Ollama

✅Using Llama-3.1-70B-Instruct-GGUF with Ollama

Ollama does have a Model Library where you can download models:

https://ollama.com/library

This Model Library offers all sizes of regular Lama 3.1, as well as the 8B version of Llama 3.1-Instruct. However, if you would like to use the 70B quantized version of Llama 3.1-Instruct then you will have to use the following instructions.

#	Running the 70B quantized version of Llama 3.1-Instruct with Ollama
1.	Download your desired version of in the Files and Versions section of this Model Repository
2.	Next, create a Modelfile configuration that defines the model's behavior. For Example:

# Modelfile
FROM "./Llama-3.1-70B-Instruct-Q4_K_M.gguf"
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
TEMPLATE """
<|im_start|>system
<|im_end|>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
"""

Replace ./Llama-3.1-70B-Instruct-Q4_K_M.gguf with the correct version and actual path to the GGUF file you downloaded. The TEMPLATE line defines the prompt format using system, user, and assistant roles. You can customize this based on your use case.

#	Running the 70B quantized version of Llama 3.1-Instruct with Ollama - continued
3.	Now, build the Ollama model using the ollama create command:

ollama create "Llama-3.1-70B-Instruct-Q4_K_M" -f ./Llama-3.1-70B-Instruct-Q4_K_M.gguf

Once again Replace the name: Llama-3.1-70B-Instruct-Q4_K_M and the model: ./Llama-3.1-70B-Instruct-Q4_K_M.gguf with the quantized model you are using.

#	Running the 70B quantized version of Llama 3.1-Instruct with Ollama - continued
4.	You then can run your model using the ollama run command:

ollama run Llama-3.1-70B-Instruct-Q4_K_M