brunopio commited on
Commit
0eb709a
1 Parent(s): a2d2f90

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -76
README.md CHANGED
@@ -9,93 +9,45 @@ base_model:
9
 
10
  ### Llama3-8B-1.58 Models
11
 
12
- The **Llama3-8B-1.58** models are large language models fine-tuned on the **BitNet 1.58b architecture**, starting from the base model **Llama-3-8B-Instruct**.
13
 
14
- For a deeper dive into the methods and results, check out our [blog post](https://huggingface.co/blog/1_58_llm_extreme_quantization).
15
 
 
 
16
 
17
- ## Model Details
18
-
19
- ### Model Sources
20
-
21
- <!-- Provide the basic links for the model. -->
22
-
23
- - **Repository:** [Model](https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens)
24
- - **Paper:** [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764)
25
-
26
-
27
- ## How to Get Started with the Model
28
 
29
- You can easily load and test our model in Transformers. Just follow the code below:
 
30
 
31
- Start by installing the transformers version with the correct configuration to load bitnet models
32
  ```bash
33
- pip install git+https://github.com/huggingface/transformers.git@refs/pull/33410/head
34
  ```
35
- And then load the model :
36
- ```python
37
-
38
- model = AutoModelForCausalLM.from_pretrained("HF1BitLLM/Llama3-8B-1.58-100B-tokens", device_map="cuda", torch_dtype=torch.bfloat16)
39
- tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
40
-
41
- input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"
42
 
43
- input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
44
- output = model.generate(input_ids, max_length=10, do_sample=False)
45
- generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
46
- print(generated_text)
47
  ```
48
 
49
- ## Training Details
50
-
51
- ### Training Data
52
-
53
- The model was trained on a subset of [FineWeb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
54
-
55
- ### Training Process
56
-
57
- 1. **Starting Point**
58
- - Best-performing checkpoint from the 10 billion token runs with a linear lambda scheduler
59
-
60
- 2. **Training Duration**
61
- - Fine-tuned for an additional 45,000 steps
62
- - Reached a total of 100 billion tokens
63
-
64
- 3. **Dataset**
65
- - FineWeb-edu dataset
66
-
67
- 4. **Batch Size**
68
- - 2 million tokens per step
69
- - Total per run: 45,000 steps * 2 million tokens = 90 billion tokens
70
- - Combined with initial 10 billion tokens to reach 100 billion
71
-
72
- 5. **Learning Rate Experiments**
73
- - Tested various learning rates to find optimal setting, according the to experiments, the best performing peak lr is 1e-5
74
-
75
- 6. **Performance**
76
- - Close to Llama3 8B on some metrics
77
- - Behind Llama3 8B in overall average performance
78
-
79
- 7. **Evaluation**
80
- - Metrics included perplexity, MMLU scores, and other standard benchmarks
81
-
82
- These extended training runs on 100 billion tokens pushed the boundaries of highly quantized models, bringing performance closer to half-precision models like Llama3.
83
-
84
-
85
- ## Evaluation
86
-
87
- The evaluation of the models is done on the nanotron checkpoints using LightEval :
88
-
89
- ![results](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/1.58llm_extreme_quantization/metrics_100B_table.png)
90
-
91
 
 
 
 
 
92
 
93
- ## Citation
 
 
 
94
 
95
- ```bash
96
- @misc{,
97
- title={1.58-Bit LLM: A New Era of Extreme Quantization},
98
- author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
99
- year={2024},
100
- }
 
101
  ```
 
9
 
10
  ### Llama3-8B-1.58 Models
11
 
12
+ This model was converted to GGUF format from [Model](https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens) using llama.cpp.
13
 
 
14
 
15
+ ## Use with llama.cpp
16
+ Install llama.cpp through brew (works on Mac and Linux)
17
 
18
+ ```bash
19
+ brew install llama.cpp
 
 
 
 
 
 
 
 
 
20
 
21
+ ```
22
+ Invoke the llama.cpp server or the CLI.
23
 
24
+ ### CLI:
25
  ```bash
26
+ llama-cli --hf-repo brunopio/Llama3-8B-1.58-100B-tokens-GGUF --hf-file Llama3-8B-1.58-100B-tokens-GGUF -p "The meaning to life and the universe is"
27
  ```
 
 
 
 
 
 
 
28
 
29
+ ### Server:
30
+ ```bash
31
+ llama-server --hf-repo brunopio/Llama3-8B-1.58-100B-tokens-GGUF --hf-file Llama3-8B-1.58-100B-tokens-GGUF -c 2048
 
32
  ```
33
 
34
+ Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
+ Step 1: Clone llama.cpp from GitHub.
37
+ ```
38
+ git clone https://github.com/ggerganov/llama.cpp
39
+ ```
40
 
41
+ Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
42
+ ```
43
+ cd llama.cpp && LLAMA_CURL=1 make
44
+ ```
45
 
46
+ Step 3: Run inference through the main binary.
47
+ ```
48
+ ./llama-cli --hf-repo brunopio/Llama3-8B-1.58-100B-tokens-GGUF --hf-file Llama3-8B-1.58-100B-tokens-GGUF -p "The meaning to life and the universe is"
49
+ ```
50
+ or
51
+ ```
52
+ ./llama-server --hf-repo brunopio/Llama3-8B-1.58-100B-tokens-GGUF --hf-file Llama3-8B-1.58-100B-tokens-GGUF -c 2048
53
  ```