leo-pekelis-gradient
commited on
Commit
•
df6e96a
1
Parent(s):
848a82c
Update README.md
Browse files
README.md
CHANGED
@@ -12,19 +12,19 @@ license: llama3
|
|
12 |
# Llama-3 8B Instruct 1048k
|
13 |
Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model, drop us a message at [email protected].
|
14 |
|
15 |
-
This model extends LLama-3 8B's context length from 8k to
|
16 |
-
|
|
|
17 |
|
18 |
**Approach:**
|
19 |
|
20 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
21 |
-
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by
|
22 |
-
- Progressive training on increasing context lengths similar to
|
23 |
|
24 |
**Infra:**
|
25 |
|
26 |
-
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to
|
27 |
-
|
28 |
|
29 |
**Data:**
|
30 |
|
@@ -32,17 +32,22 @@ For training data, we generate long contexts by augmenting [SlimPajama](https://
|
|
32 |
|
33 |
**Progressive Training Details:**
|
34 |
|
35 |
-
|
|
36 |
-
|
37 |
-
| Initialize From
|
38 |
-
| Sequence Length
|
39 |
-
| RoPE theta
|
40 |
-
|
|
41 |
-
|
|
42 |
-
|
|
43 |
-
|
|
44 |
-
|
|
45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## The Gradient AI Team
|
48 |
|
|
|
12 |
# Llama-3 8B Instruct 1048k
|
13 |
Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model, drop us a message at [email protected].
|
14 |
|
15 |
+
This model extends LLama-3 8B's context length from 8k to > 1040K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training by appropriately adjusting RoPE theta. We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.
|
16 |
+
|
17 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/6MKLoX2ruLIaREiyb6coO.png)
|
18 |
|
19 |
**Approach:**
|
20 |
|
21 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
22 |
+
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by empirical RoPE theta optimization
|
23 |
+
- Progressive training on increasing context lengths, similar to [Large World Model](https://huggingface.co/LargeWorldModel) [2] (See details below)
|
24 |
|
25 |
**Infra:**
|
26 |
|
27 |
+
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster. Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).
|
|
|
28 |
|
29 |
**Data:**
|
30 |
|
|
|
32 |
|
33 |
**Progressive Training Details:**
|
34 |
|
35 |
+
| | 65K | 262K | 524k | 1048k |
|
36 |
+
|------------------------|-----------|-----------|-----------|-----------|
|
37 |
+
| Initialize From | LLaMA-3 7B| 65K | 262K | 524k |
|
38 |
+
| Sequence Length 2^N | 16 | 18 | 19 | 20 |
|
39 |
+
| RoPE theta | 15.3 M | 207.1 M | 1.06B | 2.80B |
|
40 |
+
| batch_size | 1 | 1 | 2 | 2 |
|
41 |
+
| gradient_accumulation_steps | 32 | 16 | 1 | 1 |
|
42 |
+
| Steps | 30 | 24 | 50 | 50 |
|
43 |
+
| Total Tokens | 62914560 | 100663296 | 419430400 | 838860800 |
|
44 |
+
| learning_rate | 2.00E-05 | 2.00E-05 | 2.00E-05 | 2.00E-05 |
|
45 |
+
| # GPUs | 8 | 32 | 512 | 512 |
|
46 |
+
| Ring or Data parallelism | 1 | 1 | 8 | 8 |
|
47 |
+
| GPU Type | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S |
|
48 |
+
| Minutes to Train (Wall)| 202 | 555 | 61 | 87 |
|
49 |
+
|
50 |
+
|
51 |
|
52 |
## The Gradient AI Team
|
53 |
|