Update README.md
Browse files
README.md
CHANGED
@@ -9,93 +9,45 @@ base_model:
|
|
9 |
|
10 |
### Llama3-8B-1.58 Models
|
11 |
|
12 |
-
|
13 |
|
14 |
-
For a deeper dive into the methods and results, check out our [blog post](https://huggingface.co/blog/1_58_llm_extreme_quantization).
|
15 |
|
|
|
|
|
16 |
|
17 |
-
|
18 |
-
|
19 |
-
### Model Sources
|
20 |
-
|
21 |
-
<!-- Provide the basic links for the model. -->
|
22 |
-
|
23 |
-
- **Repository:** [Model](https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens)
|
24 |
-
- **Paper:** [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764)
|
25 |
-
|
26 |
-
|
27 |
-
## How to Get Started with the Model
|
28 |
|
29 |
-
|
|
|
30 |
|
31 |
-
|
32 |
```bash
|
33 |
-
|
34 |
```
|
35 |
-
And then load the model :
|
36 |
-
```python
|
37 |
-
|
38 |
-
model = AutoModelForCausalLM.from_pretrained("HF1BitLLM/Llama3-8B-1.58-100B-tokens", device_map="cuda", torch_dtype=torch.bfloat16)
|
39 |
-
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
|
40 |
-
|
41 |
-
input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"
|
42 |
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
print(generated_text)
|
47 |
```
|
48 |
|
49 |
-
|
50 |
-
|
51 |
-
### Training Data
|
52 |
-
|
53 |
-
The model was trained on a subset of [FineWeb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
|
54 |
-
|
55 |
-
### Training Process
|
56 |
-
|
57 |
-
1. **Starting Point**
|
58 |
-
- Best-performing checkpoint from the 10 billion token runs with a linear lambda scheduler
|
59 |
-
|
60 |
-
2. **Training Duration**
|
61 |
-
- Fine-tuned for an additional 45,000 steps
|
62 |
-
- Reached a total of 100 billion tokens
|
63 |
-
|
64 |
-
3. **Dataset**
|
65 |
-
- FineWeb-edu dataset
|
66 |
-
|
67 |
-
4. **Batch Size**
|
68 |
-
- 2 million tokens per step
|
69 |
-
- Total per run: 45,000 steps * 2 million tokens = 90 billion tokens
|
70 |
-
- Combined with initial 10 billion tokens to reach 100 billion
|
71 |
-
|
72 |
-
5. **Learning Rate Experiments**
|
73 |
-
- Tested various learning rates to find optimal setting, according the to experiments, the best performing peak lr is 1e-5
|
74 |
-
|
75 |
-
6. **Performance**
|
76 |
-
- Close to Llama3 8B on some metrics
|
77 |
-
- Behind Llama3 8B in overall average performance
|
78 |
-
|
79 |
-
7. **Evaluation**
|
80 |
-
- Metrics included perplexity, MMLU scores, and other standard benchmarks
|
81 |
-
|
82 |
-
These extended training runs on 100 billion tokens pushed the boundaries of highly quantized models, bringing performance closer to half-precision models like Llama3.
|
83 |
-
|
84 |
-
|
85 |
-
## Evaluation
|
86 |
-
|
87 |
-
The evaluation of the models is done on the nanotron checkpoints using LightEval :
|
88 |
-
|
89 |
-
![results](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/1.58llm_extreme_quantization/metrics_100B_table.png)
|
90 |
-
|
91 |
|
|
|
|
|
|
|
|
|
92 |
|
93 |
-
|
|
|
|
|
|
|
94 |
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
|
|
101 |
```
|
|
|
9 |
|
10 |
### Llama3-8B-1.58 Models
|
11 |
|
12 |
+
This model was converted to GGUF format from [Model](https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens) using llama.cpp.
|
13 |
|
|
|
14 |
|
15 |
+
## Use with llama.cpp
|
16 |
+
Install llama.cpp through brew (works on Mac and Linux)
|
17 |
|
18 |
+
```bash
|
19 |
+
brew install llama.cpp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
+
```
|
22 |
+
Invoke the llama.cpp server or the CLI.
|
23 |
|
24 |
+
### CLI:
|
25 |
```bash
|
26 |
+
llama-cli --hf-repo brunopio/Llama3-8B-1.58-100B-tokens-GGUF --hf-file Llama3-8B-1.58-100B-tokens-GGUF -p "The meaning to life and the universe is"
|
27 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
+
### Server:
|
30 |
+
```bash
|
31 |
+
llama-server --hf-repo brunopio/Llama3-8B-1.58-100B-tokens-GGUF --hf-file Llama3-8B-1.58-100B-tokens-GGUF -c 2048
|
|
|
32 |
```
|
33 |
|
34 |
+
Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
+
Step 1: Clone llama.cpp from GitHub.
|
37 |
+
```
|
38 |
+
git clone https://github.com/ggerganov/llama.cpp
|
39 |
+
```
|
40 |
|
41 |
+
Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
|
42 |
+
```
|
43 |
+
cd llama.cpp && LLAMA_CURL=1 make
|
44 |
+
```
|
45 |
|
46 |
+
Step 3: Run inference through the main binary.
|
47 |
+
```
|
48 |
+
./llama-cli --hf-repo brunopio/Llama3-8B-1.58-100B-tokens-GGUF --hf-file Llama3-8B-1.58-100B-tokens-GGUF -p "The meaning to life and the universe is"
|
49 |
+
```
|
50 |
+
or
|
51 |
+
```
|
52 |
+
./llama-server --hf-repo brunopio/Llama3-8B-1.58-100B-tokens-GGUF --hf-file Llama3-8B-1.58-100B-tokens-GGUF -c 2048
|
53 |
```
|