File size: 2,988 Bytes
f91919a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
datasets:
- NeelNanda/c4-code-20k
tags:
- mechanistic_interpretability
---
### GELU_2L512W_C4_Code Model Card
**Model Overview**
- **Model Name:** GELU_2L512W_C4_Code
- **Version:** 201
- **Primary Application:** Code-related tasks
- **Model Architecture:** Transformer-based
- **Activation Function:** GELU (Gaussian Error Linear Unit)
- **Normalization:** Layer Normalization (LN)
**Model Specifications**
- **Number of Layers:** 2
- **Model Dimension (d_model):** 512
- **MLP Dimension (d_mlp):** 2048
- **Head Dimension (d_head):** 64
- **Number of Heads (n_heads):** 8
- **Context Size (n_ctx):** 1024
- **Vocabulary Size (d_vocab):** 48,262
- **Number of Parameters:** 6,291,456
**Training Configurations**
- **Dataset:** c4_code
- **Batch Size per Device:** 32
- **Total Batch Size:** 256
- **Batches per Step:** 1
- **Max Steps:** 83,923
- **Warmup Steps:** 1,144
- **Learning Rate Schedule:** Cosine Warmup
- **Learning Rate (Hidden Layers):** 0.002
- **Learning Rate (Vector):** 0.001
- **Optimizer Betas:** [0.9, 0.99]
- **Weight Decay:** 0.05
- **Gradient Norm Clipping:** 1.0
- **Max Tokens:** 22,000,000,000
- **Warmup Tokens:** 300,000,000
- **Truncate Tokens:** 1,000,000,000,000
**Technical Specifications**
- **Number of Devices:** 8
- **Seed:** 259123
- **Use of bfloat16 for MatMul:** True
- **Debug Options:** Disabled
- **Save Checkpoints:** Enabled
- **Tokens per Step:** 262,144
- **Initializer Scales:**
- Global: 1.0
- Hidden: 0.02
- Embed: 0.1
- Unembed: 0.02
- **Neuron Scale:** 1.0
- **Neuron Temperature:** 1.0
- **Weight Initialization Scheme:** GPT-2
- **Fixed Initialization:** 2L512W_init
**Tokenizer**
- **Name:** NeelNanda/gpt-neox-tokenizer-digits
**Miscellaneous**
- **Layer-wise Learning Rate Decay:** 0.99
- **Log Interval:** 50
- **Control Parameter:** 1.0
- **Shortformer Positional Embedding:** Disabled
- **Attention Only:** False
- **Use Accelerated Computation:** False
- **Layer Normalization Epsilon:** 1e-05
**Model Limitations & Ethical Considerations**
- This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
- As with any AI model, results may vary depending on the complexity and specificity of the task.
- Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.
**Notes for Users**
- The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
- Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.
---
*This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.* |