metadata
datasets:
- NeelNanda/c4-code-20k
tags:
- mechanistic_interpretability
GELU_2L512W_C4_Code Model Card
Model Overview
- Model Name: GELU_2L512W_C4_Code
- Version: 201
- Primary Application: Code-related tasks
- Model Architecture: Transformer-based
- Activation Function: GELU (Gaussian Error Linear Unit)
- Normalization: Layer Normalization (LN)
Model Specifications
- Number of Layers: 2
- Model Dimension (d_model): 512
- MLP Dimension (d_mlp): 2048
- Head Dimension (d_head): 64
- Number of Heads (n_heads): 8
- Context Size (n_ctx): 1024
- Vocabulary Size (d_vocab): 48,262
- Number of Parameters: 6,291,456
Training Configurations
- Dataset: c4_code
- Batch Size per Device: 32
- Total Batch Size: 256
- Batches per Step: 1
- Max Steps: 83,923
- Warmup Steps: 1,144
- Learning Rate Schedule: Cosine Warmup
- Learning Rate (Hidden Layers): 0.002
- Learning Rate (Vector): 0.001
- Optimizer Betas: [0.9, 0.99]
- Weight Decay: 0.05
- Gradient Norm Clipping: 1.0
- Max Tokens: 22,000,000,000
- Warmup Tokens: 300,000,000
- Truncate Tokens: 1,000,000,000,000
Technical Specifications
- Number of Devices: 8
- Seed: 259123
- Use of bfloat16 for MatMul: True
- Debug Options: Disabled
- Save Checkpoints: Enabled
- Tokens per Step: 262,144
- Initializer Scales:
- Global: 1.0
- Hidden: 0.02
- Embed: 0.1
- Unembed: 0.02
- Neuron Scale: 1.0
- Neuron Temperature: 1.0
- Weight Initialization Scheme: GPT-2
- Fixed Initialization: 2L512W_init
Tokenizer
- Name: NeelNanda/gpt-neox-tokenizer-digits
Miscellaneous
- Layer-wise Learning Rate Decay: 0.99
- Log Interval: 50
- Control Parameter: 1.0
- Shortformer Positional Embedding: Disabled
- Attention Only: False
- Use Accelerated Computation: False
- Layer Normalization Epsilon: 1e-05
Model Limitations & Ethical Considerations
- This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
- As with any AI model, results may vary depending on the complexity and specificity of the task.
- Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.
Notes for Users
- The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
- Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.
This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.