GELU_2L512W_C4_Code / README.md
smejak's picture
Create README.md
f91919a verified
|
raw
history blame
2.99 kB
metadata
datasets:
  - NeelNanda/c4-code-20k
tags:
  - mechanistic_interpretability

GELU_2L512W_C4_Code Model Card

Model Overview

  • Model Name: GELU_2L512W_C4_Code
  • Version: 201
  • Primary Application: Code-related tasks
  • Model Architecture: Transformer-based
  • Activation Function: GELU (Gaussian Error Linear Unit)
  • Normalization: Layer Normalization (LN)

Model Specifications

  • Number of Layers: 2
  • Model Dimension (d_model): 512
  • MLP Dimension (d_mlp): 2048
  • Head Dimension (d_head): 64
  • Number of Heads (n_heads): 8
  • Context Size (n_ctx): 1024
  • Vocabulary Size (d_vocab): 48,262
  • Number of Parameters: 6,291,456

Training Configurations

  • Dataset: c4_code
  • Batch Size per Device: 32
  • Total Batch Size: 256
  • Batches per Step: 1
  • Max Steps: 83,923
  • Warmup Steps: 1,144
  • Learning Rate Schedule: Cosine Warmup
  • Learning Rate (Hidden Layers): 0.002
  • Learning Rate (Vector): 0.001
  • Optimizer Betas: [0.9, 0.99]
  • Weight Decay: 0.05
  • Gradient Norm Clipping: 1.0
  • Max Tokens: 22,000,000,000
  • Warmup Tokens: 300,000,000
  • Truncate Tokens: 1,000,000,000,000

Technical Specifications

  • Number of Devices: 8
  • Seed: 259123
  • Use of bfloat16 for MatMul: True
  • Debug Options: Disabled
  • Save Checkpoints: Enabled
  • Tokens per Step: 262,144
  • Initializer Scales:
    • Global: 1.0
    • Hidden: 0.02
    • Embed: 0.1
    • Unembed: 0.02
  • Neuron Scale: 1.0
  • Neuron Temperature: 1.0
  • Weight Initialization Scheme: GPT-2
  • Fixed Initialization: 2L512W_init

Tokenizer

  • Name: NeelNanda/gpt-neox-tokenizer-digits

Miscellaneous

  • Layer-wise Learning Rate Decay: 0.99
  • Log Interval: 50
  • Control Parameter: 1.0
  • Shortformer Positional Embedding: Disabled
  • Attention Only: False
  • Use Accelerated Computation: False
  • Layer Normalization Epsilon: 1e-05

Model Limitations & Ethical Considerations

  • This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
  • As with any AI model, results may vary depending on the complexity and specificity of the task.
  • Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.

Notes for Users

  • The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
  • Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.

This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.