README.md · NeelNanda/GELU_2L512W_C4

metadata

datasets:
  - NeelNanda/c4-code-20k
tags:
  - mechanistic_interpretability

GELU_2L512W_C4_Code Model Card

Model Overview

Model Name: GELU_2L512W_C4_Code
Version: 201
Primary Application: Code-related tasks
Model Architecture: Transformer-based
Activation Function: GELU (Gaussian Error Linear Unit)
Normalization: Layer Normalization (LN)

Model Specifications

Number of Layers: 2
Model Dimension (d_model): 512
MLP Dimension (d_mlp): 2048
Head Dimension (d_head): 64
Number of Heads (n_heads): 8
Context Size (n_ctx): 1024
Vocabulary Size (d_vocab): 48,262
Number of Parameters: 6,291,456

Training Configurations

Dataset: c4_code
Batch Size per Device: 32
Total Batch Size: 256
Batches per Step: 1
Max Steps: 83,923
Warmup Steps: 1,144
Learning Rate Schedule: Cosine Warmup
Learning Rate (Hidden Layers): 0.002
Learning Rate (Vector): 0.001
Optimizer Betas: [0.9, 0.99]
Weight Decay: 0.05
Gradient Norm Clipping: 1.0
Max Tokens: 22,000,000,000
Warmup Tokens: 300,000,000
Truncate Tokens: 1,000,000,000,000

Technical Specifications

Number of Devices: 8
Seed: 259123
Use of bfloat16 for MatMul: True
Debug Options: Disabled
Save Checkpoints: Enabled
Tokens per Step: 262,144
Initializer Scales:
- Global: 1.0
- Hidden: 0.02
- Embed: 0.1
- Unembed: 0.02
Neuron Scale: 1.0
Neuron Temperature: 1.0
Weight Initialization Scheme: GPT-2
Fixed Initialization: 2L512W_init

Tokenizer

Name: NeelNanda/gpt-neox-tokenizer-digits

Miscellaneous

Layer-wise Learning Rate Decay: 0.99
Log Interval: 50
Control Parameter: 1.0
Shortformer Positional Embedding: Disabled
Attention Only: False
Use Accelerated Computation: False
Layer Normalization Epsilon: 1e-05

Model Limitations & Ethical Considerations

This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
As with any AI model, results may vary depending on the complexity and specificity of the task.
Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.

Notes for Users

The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.

This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.