File size: 7,194 Bytes
a9d7cf6 2f049ca a9d7cf6 2f049ca a9d7cf6 2f049ca a9d7cf6 0a36318 a9d7cf6 ffd4d10 a9d7cf6 26cd2a7 ffd4d10 0a36318 ffd4d10 0a36318 ffd4d10 0a36318 a9d7cf6 4ab7806 ffd4d10 4ab7806 cda2e79 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: apache-2.0
language:
- en
widget:
- text: It is raining and my family
example_title: Example 1
- text: We entered into the forest and
example_title: Example 2
- text: I sat for doing my homework
example_title: Example 3
---
# Custom GPT Model
## Model Description
This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library.
---
## Model Parameters
- **Block Size**: `256` (Maximum sequence length)
- **Vocab Size**: `50257` (Includes 50,000 BPE merges, 256 byte-level tokens, and 1 special token)
- **Number of Layers**: `8`
- **Number of Heads**: `8`
- **Embedding Dimension**: `768`
- **Max Learning Rate**: `0.0006`
- **Min Learning Rate**: `0.00006` (10% of max_lr)
- **Warmup Steps**: `715`
- **Max Steps**: `52000`
- **Total Batch Size**: `524288` (Number of tokens per batch)
- **Micro Batch Size**: `128`
- **Sequence Length**: `256`
## Model Parameters Details
### Decayed Parameters
- **Total Decayed Parameters**: 95,453,184
Decayed parameters typically include weights from the model's various layers (like the transformer blocks), which are subject to weight decay during optimization. This technique helps in regularizing the model, potentially reducing overfitting by penalizing large weights.
### Non-Decayed Parameters
- **Total Non-Decayed Parameters**: 81,408
Non-decayed parameters generally involve biases and layer normalization parameters. These parameters are excluded from weight decay as applying decay can adversely affect the training process by destabilizing the learning dynamics.
### Total Parameters
- **Overall Total Parameters**: 95,534,592
The calculated total number of parameters includes both decayed and non-decayed tensors, summing up to over 95 million parameters.
---
## Dataset Description
### Overview
For the training of this model, a significant subset of the **HuggingFaceFW/fineweb-edu** dataset was utilized. Specifically, the model was pretrained on 3 billion tokens selected from the "Sample 10B" segment of the dataset. This dataset provides a rich corpus compiled from educational and academic web sources, making it an excellent foundation for developing language models with a strong grasp of academic and formal text.
### Dataset Source
The dataset is hosted and maintained on Hugging Face's dataset repository. More detailed information and access to the dataset can be found through its dedicated page:
[HuggingFaceFW/fineweb-edu Sample 10B](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/sample/10BT)
### Training Details
- **Total Tokens Used for Training**: 3 billion tokens
- **Training Duration**: The model was trained over 3 epochs to ensure sufficient exposure to the data while optimizing the learning trajectory.
---
## Model Evaluation on HellaSwag Dataset
### Performance Overview
The evaluation of our model, "orator," on the HellaSwag dataset demonstrates significant progress in understanding context-based predictions. Below, we detail the performance through loss and accuracy graphs, accompanied by specific metrics.
### Graph Analysis
#### Loss Graph
![Loss Graph](output1.png)
- **Blue Line (Train Loss)**: Represents the model's loss on the training set over the number of training steps. It shows a sharp decline initially, indicating rapid learning, followed by fluctuations that gradually stabilize.
- **Orange Line (Validation Loss)**: Shows the loss on the validation set. This line is smoother than the training loss, indicating general stability and effectiveness of the model against unseen data.
- **Red Dashed Line**: Marks the validation loss of a baseline model, OpenAI's GPT-2 (124M), for comparison. Our model achieves lower validation loss, indicating improved performance.
#### Accuracy Graph (HellaSwag Eval)
![Accuracy Graph](output2.png)
- **Blue Line**: This line represents the accuracy of the "orator" model on the HellaSwag evaluation set. It shows a steady increase in accuracy, reflecting the model's improving capability to correctly predict or complete new scenarios.
- **Red Dashed Line**: This is the accuracy of the baseline OpenAI GPT-2 (124M) model. Our model consistently surpasses this benchmark after initial training phases.
### Key Metrics
- **Minimum Training Loss**: `2.883471`
- **Minimum Validation Loss**: `3.1989`
- **Maximum HellaSwag Evaluation Accuracy**: `0.3054`
---
### Tokenization
For tokenization, this model uses:
```python
tokenizer = tiktoken.get_encoding("gpt2")
```
---
## How to Use the Model
### Load and Generate Text
Below is a Python example on how to load the model and generate text:
```python
import torch
from torch.nn import functional as F
from gpt_class import GPTConfig, GPT
import tiktoken
# Set up the device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model
state_dict = torch.load('model_51999.pt', map_location=device)
config = state_dict['config']
model = GPT(config)
model.load_state_dict(state_dict['model'])
model.to(device)
model.eval()
# Seed for reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
# Tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
def Generate(model, tokenizer, example, num_return_sequences, max_length):
model.eval()
tokens = tokenizer.encode(example)
tokens = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).repeat(num_return_sequences, 1)
tokens = tokens.to(device)
sample_rng = torch.Generator(device=device)
xgen = tokens
while xgen.size(1) < max_length:
with torch.no_grad():
with torch.autocast(device_type=device):
logits, _ = model(xgen)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
ix = torch.multinomial(topk_probs, 1, generator=sample_rng)
xcol = torch.gather(topk_indices, -1, ix)
xgen = torch.cat((xgen, xcol), dim=1)
for i in range(num_return_sequences):
tokens = xgen[i, :max_length].tolist()
decoded = tokenizer.decode(tokens)
print(f"Sample {i+1}: {decoded}")
# Example usage
Generate(model, tokenizer, example="As we entered the forest we saw", num_return_sequences=4, max_length=32)
```
### Sample Output
```
Sample 1: As we entered the forest we saw huge white pine fells at the tops of the high plateaus (the great peaks) and trees standing at ground level.
Sample 2: As we entered the forest we saw a few trees that were too large. We realized they were not going to be very big. There was one tree that was
Sample 3: As we entered the forest we saw a group of small, wood-dwelling bees who had managed to escape a predator. A farmer was holding a handful
Sample 4: As we entered the forest we saw giant, blue-eyed, spotted beetles on the ground, a grayling beetle in my lawn next to the pond, an
```
## Accessing the Original Code
The original code for the Orator model (for architecture, Pre-Training, Evaluating, Generating) can be accessed through the GitHub repository.
**GitHub Repository**: [my-temporary-name/my_gpt2](https://github.com/my-temporary-name/my_gpt2)
|