Create README.md
#1
by
smejak
- opened
README.md
ADDED
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- NeelNanda/c4-code-20k
|
4 |
+
tags:
|
5 |
+
- mechanistic_interpretability
|
6 |
+
---
|
7 |
+
### GELU_2L512W_C4_Code Model Card
|
8 |
+
|
9 |
+
**Model Overview**
|
10 |
+
- **Model Name:** GELU_2L512W_C4_Code
|
11 |
+
- **Version:** 201
|
12 |
+
- **Primary Application:** Code-related tasks
|
13 |
+
- **Model Architecture:** Transformer-based
|
14 |
+
- **Activation Function:** GELU (Gaussian Error Linear Unit)
|
15 |
+
- **Normalization:** Layer Normalization (LN)
|
16 |
+
|
17 |
+
**Model Specifications**
|
18 |
+
- **Number of Layers:** 2
|
19 |
+
- **Model Dimension (d_model):** 512
|
20 |
+
- **MLP Dimension (d_mlp):** 2048
|
21 |
+
- **Head Dimension (d_head):** 64
|
22 |
+
- **Number of Heads (n_heads):** 8
|
23 |
+
- **Context Size (n_ctx):** 1024
|
24 |
+
- **Vocabulary Size (d_vocab):** 48,262
|
25 |
+
- **Number of Parameters:** 6,291,456
|
26 |
+
|
27 |
+
**Training Configurations**
|
28 |
+
- **Dataset:** c4_code
|
29 |
+
- **Batch Size per Device:** 32
|
30 |
+
- **Total Batch Size:** 256
|
31 |
+
- **Batches per Step:** 1
|
32 |
+
- **Max Steps:** 83,923
|
33 |
+
- **Warmup Steps:** 1,144
|
34 |
+
- **Learning Rate Schedule:** Cosine Warmup
|
35 |
+
- **Learning Rate (Hidden Layers):** 0.002
|
36 |
+
- **Learning Rate (Vector):** 0.001
|
37 |
+
- **Optimizer Betas:** [0.9, 0.99]
|
38 |
+
- **Weight Decay:** 0.05
|
39 |
+
- **Gradient Norm Clipping:** 1.0
|
40 |
+
- **Max Tokens:** 22,000,000,000
|
41 |
+
- **Warmup Tokens:** 300,000,000
|
42 |
+
- **Truncate Tokens:** 1,000,000,000,000
|
43 |
+
|
44 |
+
**Technical Specifications**
|
45 |
+
- **Number of Devices:** 8
|
46 |
+
- **Seed:** 259123
|
47 |
+
- **Use of bfloat16 for MatMul:** True
|
48 |
+
- **Debug Options:** Disabled
|
49 |
+
- **Save Checkpoints:** Enabled
|
50 |
+
- **Tokens per Step:** 262,144
|
51 |
+
- **Initializer Scales:**
|
52 |
+
- Global: 1.0
|
53 |
+
- Hidden: 0.02
|
54 |
+
- Embed: 0.1
|
55 |
+
- Unembed: 0.02
|
56 |
+
- **Neuron Scale:** 1.0
|
57 |
+
- **Neuron Temperature:** 1.0
|
58 |
+
- **Weight Initialization Scheme:** GPT-2
|
59 |
+
- **Fixed Initialization:** 2L512W_init
|
60 |
+
|
61 |
+
**Tokenizer**
|
62 |
+
- **Name:** NeelNanda/gpt-neox-tokenizer-digits
|
63 |
+
|
64 |
+
**Miscellaneous**
|
65 |
+
- **Layer-wise Learning Rate Decay:** 0.99
|
66 |
+
- **Log Interval:** 50
|
67 |
+
- **Control Parameter:** 1.0
|
68 |
+
- **Shortformer Positional Embedding:** Disabled
|
69 |
+
- **Attention Only:** False
|
70 |
+
- **Use Accelerated Computation:** False
|
71 |
+
- **Layer Normalization Epsilon:** 1e-05
|
72 |
+
|
73 |
+
**Model Limitations & Ethical Considerations**
|
74 |
+
- This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
|
75 |
+
- As with any AI model, results may vary depending on the complexity and specificity of the task.
|
76 |
+
- Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.
|
77 |
+
|
78 |
+
**Notes for Users**
|
79 |
+
- The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
|
80 |
+
- Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.
|
81 |
+
|
82 |
+
---
|
83 |
+
|
84 |
+
*This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.*
|