Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - NeelNanda/c4-code-20k
4
+ tags:
5
+ - mechanistic_interpretability
6
+ ---
7
+ ### GELU_2L512W_C4_Code Model Card
8
+
9
+ **Model Overview**
10
+ - **Model Name:** GELU_2L512W_C4_Code
11
+ - **Version:** 201
12
+ - **Primary Application:** Code-related tasks
13
+ - **Model Architecture:** Transformer-based
14
+ - **Activation Function:** GELU (Gaussian Error Linear Unit)
15
+ - **Normalization:** Layer Normalization (LN)
16
+
17
+ **Model Specifications**
18
+ - **Number of Layers:** 2
19
+ - **Model Dimension (d_model):** 512
20
+ - **MLP Dimension (d_mlp):** 2048
21
+ - **Head Dimension (d_head):** 64
22
+ - **Number of Heads (n_heads):** 8
23
+ - **Context Size (n_ctx):** 1024
24
+ - **Vocabulary Size (d_vocab):** 48,262
25
+ - **Number of Parameters:** 6,291,456
26
+
27
+ **Training Configurations**
28
+ - **Dataset:** c4_code
29
+ - **Batch Size per Device:** 32
30
+ - **Total Batch Size:** 256
31
+ - **Batches per Step:** 1
32
+ - **Max Steps:** 83,923
33
+ - **Warmup Steps:** 1,144
34
+ - **Learning Rate Schedule:** Cosine Warmup
35
+ - **Learning Rate (Hidden Layers):** 0.002
36
+ - **Learning Rate (Vector):** 0.001
37
+ - **Optimizer Betas:** [0.9, 0.99]
38
+ - **Weight Decay:** 0.05
39
+ - **Gradient Norm Clipping:** 1.0
40
+ - **Max Tokens:** 22,000,000,000
41
+ - **Warmup Tokens:** 300,000,000
42
+ - **Truncate Tokens:** 1,000,000,000,000
43
+
44
+ **Technical Specifications**
45
+ - **Number of Devices:** 8
46
+ - **Seed:** 259123
47
+ - **Use of bfloat16 for MatMul:** True
48
+ - **Debug Options:** Disabled
49
+ - **Save Checkpoints:** Enabled
50
+ - **Tokens per Step:** 262,144
51
+ - **Initializer Scales:**
52
+ - Global: 1.0
53
+ - Hidden: 0.02
54
+ - Embed: 0.1
55
+ - Unembed: 0.02
56
+ - **Neuron Scale:** 1.0
57
+ - **Neuron Temperature:** 1.0
58
+ - **Weight Initialization Scheme:** GPT-2
59
+ - **Fixed Initialization:** 2L512W_init
60
+
61
+ **Tokenizer**
62
+ - **Name:** NeelNanda/gpt-neox-tokenizer-digits
63
+
64
+ **Miscellaneous**
65
+ - **Layer-wise Learning Rate Decay:** 0.99
66
+ - **Log Interval:** 50
67
+ - **Control Parameter:** 1.0
68
+ - **Shortformer Positional Embedding:** Disabled
69
+ - **Attention Only:** False
70
+ - **Use Accelerated Computation:** False
71
+ - **Layer Normalization Epsilon:** 1e-05
72
+
73
+ **Model Limitations & Ethical Considerations**
74
+ - This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
75
+ - As with any AI model, results may vary depending on the complexity and specificity of the task.
76
+ - Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.
77
+
78
+ **Notes for Users**
79
+ - The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
80
+ - Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.
81
+
82
+ ---
83
+
84
+ *This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.*