NeelNanda
/

GELU_2L512W_C4_Code

Inference Endpoints

Model card Files Files and versions Community

Create README.md

#1

by smejak - opened Jan 25

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +84 -0

README.md ADDED Viewed

	@@ -0,0 +1,84 @@

+---
+datasets:
+- NeelNanda/c4-code-20k
+tags:
+- mechanistic_interpretability
+---
+### GELU_2L512W_C4_Code Model Card
+**Model Overview**
+- **Model Name:** GELU_2L512W_C4_Code
+- **Version:** 201
+- **Primary Application:** Code-related tasks
+- **Model Architecture:** Transformer-based
+- **Activation Function:** GELU (Gaussian Error Linear Unit)
+- **Normalization:** Layer Normalization (LN)
+**Model Specifications**
+- **Number of Layers:** 2
+- **Model Dimension (d_model):** 512
+- **MLP Dimension (d_mlp):** 2048
+- **Head Dimension (d_head):** 64
+- **Number of Heads (n_heads):** 8
+- **Context Size (n_ctx):** 1024
+- **Vocabulary Size (d_vocab):** 48,262
+- **Number of Parameters:** 6,291,456
+**Training Configurations**
+- **Dataset:** c4_code
+- **Batch Size per Device:** 32
+- **Total Batch Size:** 256
+- **Batches per Step:** 1
+- **Max Steps:** 83,923
+- **Warmup Steps:** 1,144
+- **Learning Rate Schedule:** Cosine Warmup
+- **Learning Rate (Hidden Layers):** 0.002
+- **Learning Rate (Vector):** 0.001
+- **Optimizer Betas:** [0.9, 0.99]
+- **Weight Decay:** 0.05
+- **Gradient Norm Clipping:** 1.0
+- **Max Tokens:** 22,000,000,000
+- **Warmup Tokens:** 300,000,000
+- **Truncate Tokens:** 1,000,000,000,000
+**Technical Specifications**
+- **Number of Devices:** 8
+- **Seed:** 259123
+- **Use of bfloat16 for MatMul:** True
+- **Debug Options:** Disabled
+- **Save Checkpoints:** Enabled
+- **Tokens per Step:** 262,144
+- **Initializer Scales:**
+  - Global: 1.0
+  - Hidden: 0.02
+  - Embed: 0.1
+  - Unembed: 0.02
+- **Neuron Scale:** 1.0
+- **Neuron Temperature:** 1.0
+- **Weight Initialization Scheme:** GPT-2
+- **Fixed Initialization:** 2L512W_init
+**Tokenizer**
+- **Name:** NeelNanda/gpt-neox-tokenizer-digits
+**Miscellaneous**
+- **Layer-wise Learning Rate Decay:** 0.99
+- **Log Interval:** 50
+- **Control Parameter:** 1.0
+- **Shortformer Positional Embedding:** Disabled
+- **Attention Only:** False
+- **Use Accelerated Computation:** False
+- **Layer Normalization Epsilon:** 1e-05
+**Model Limitations & Ethical Considerations**
+- This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
+- As with any AI model, results may vary depending on the complexity and specificity of the task.
+- Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.
+**Notes for Users**
+- The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
+- Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.
+---
+*This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.*