File size: 13,514 Bytes
370fbdb
 
 
 
 
 
 
 
 
 
 
4072371
 
cb9fa31
4072371
 
 
 
 
 
370fbdb
a2cf896
370fbdb
a2cf896
 
 
 
 
370fbdb
4072371
6329a82
370fbdb
 
 
 
 
 
 
 
 
 
6329a82
4072371
6329a82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4072371
6329a82
daaec3a
 
d4e72e1
 
 
 
daaec3a
 
6329a82
 
 
 
 
 
 
 
4072371
 
 
 
 
370fbdb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c4c11e4
370fbdb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb9fa31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
base_model: [meta-llama/Meta-Llama-3.1-405B-Instruct]
---


# 🚀 CPU optimized quantizations of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) 🖥️

This repository contains CPU-optimized GGUF quantizations of the Meta-Llama-3.1-405B-Instruct model. These quantizations are designed to run efficiently on CPU hardware while maintaining good performance.

## Available Quantizations

Available Quantizations

1. Q4_0_4_8 (CPU FMA-Optimized): ~246 GB
2. IQ4_XS (Fastest for CPU/GPU): ~212 GB
3. Q2K-Q8 Mixed quant with iMatrix: ~154 GB
4. Q2K-Q8 Mixed without iMat for testing: ~165 GB
5. 1-bit Custom per weight COHERENT quant: ~103 GB
6. BF16: ~811 GB (original model)
7. Q8_0: ~406 GB (original model)

## Use Aria2 for parallelized downloads, links will download 9x faster

>>[!TIP]🐧 On Linux `sudo apt install -y aria2`
>>
>>🍎 On Mac `brew install aria2`
>>
>>Feel free to paste these all in at once or one at a time

### Q4_0_48 (CPU FMA Optimized Specifically for ARM server chips, NOT TESTED on X86)


```bash
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00002-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00002-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00003-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00003-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00004-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00004-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00005-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00005-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gguf
```


### IQ4_XS Version - Fastest for CPU/GPU should work everywhere (Size: ~212 GB)
```bash
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00001-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00001-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00002-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00002-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00003-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00003-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00004-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00004-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00005-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00005-of-00005.gguf
```

### 1-bit Custom Per Weight Quantization (Size: ~103 GB)
```bash
aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00001-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00001-of-00003.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00002-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00002-of-00003.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00003-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00003-of-00003.gguf
```


### Q2K-Q8 Mixed 2bit 8bit I wrote myself. This is the smallest coherent one I could make WITHOUT imatrix

```verilog
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00001-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00001-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00002-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00002-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00003-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00003-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00004-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00004-of-00004.gguf
```

### Same as above but with higher quality iMatrix Q2K-Q8 (Size: ~154 GB) USE THIS ONE
```bash
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00001-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00001-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00002-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00002-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00003-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00003-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00004-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00004-of-00004.gguf
```

<figure>
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/DD71wAB7DlQBmTG8wVaWS.png" alt="Q4_0_48 CPU Optimized example response">
  <figcaption><strong>Q4_0_48 (CPU Optimized) (246GB):</strong> Example response of 20000 token prompt</figcaption>
</figure>

### BF16 Version

```bash
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00001-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00001-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00002-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00002-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00003-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00003-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00004-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00004-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00005-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00005-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00006-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00006-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00007-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00007-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00008-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00008-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00009-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00009-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00010-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00010-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00011-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00011-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00012-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00012-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00013-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00013-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00014-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00014-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00015-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00015-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00016-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00016-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00017-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00017-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00018-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00018-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00019-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00019-of-00019.gguf
```

### Q8_0 Version

```bash
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00001-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00001-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00002-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00002-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00003-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00003-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00004-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00004-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00005-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00005-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00006-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00006-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00007-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00007-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00008-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00008-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00009-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00009-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00010-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00010-of-00010.gguf
```

## Usage

After downloading, you can use these models with libraries like `llama.cpp`. Here's a basic example:

```bash
 ./llama-cli -t 32 --temp 0.4 -fa -m ~/meow/meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf -b 512 -c 9000 -p "Adopt the persona of a NASA JPL mathmatician and firendly helpful programmer." -cnv -co -i
```

## Model Information

This model is based on the [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) model. It's an instruction-tuned version of the 405B parameter Llama 3.1 model, designed for assistant-like chat and various natural language generation tasks.

Key features:
- 405 billion parameters
- Supports 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai
- 128k context length
- Uses Grouped-Query Attention (GQA) for improved inference scalability

For more detailed information about the base model, please refer to the [original model card](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).

## License

The use of this model is subject to the [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE). Please ensure you comply with the license terms when using this model.

## Acknowledgements

Special thanks to the Meta AI team for creating and releasing the Llama 3.1 model series.

## Enjoy; more quants and perplexity benchmarks coming.