--- license: mit datasets: - HuggingFaceFW/fineweb-edu language: - en pipeline_tag: text-generation library_name: transformers --- # GPT-α A pretrained GPT model with 124M parameters trained on 40B tokens of educational content. The full implementation of the model can be found on GitHub [here](https://github.com/fraserlove/gpt-alpha). The model was trained for 4 epochs on the 10B token subset of [fineweb-edu](https://arxiv.org/pdf/2406.17557), a large-scale dataset of educational content. Here are some example completions from the model after training on 40B tokens. The context is *`Once upon a time,'*. The completions are generated using the top-k sampling strategy with a maximum length of 64 tokens, a temperature of 1.0 and a k value of 50. ``` Once upon a time, people were going to buy the “cork” that was used to wrap and hang the wine. However, what began to be called “cork” as soon as the time rolled around was probably an artificial wine. This is how we know cork as the “cork” Once upon a time, there was a time in the history of India when the great religion of India was worshipped by only two people… the Hindus and the Jains. This is the story of how the story of India was created. India’s story begins with a very ancient Vedic religion. They were the ancient Indus valley Once upon a time, the King of Italy, who was to govern what would become the world, thought that it would be a great and noble undertaking to introduce the Roman Senate into the country in order to defend Rome — to defend her own capital in a very civilized manner, to promote the arts and promote the Roman religion. Accordingly, Rome, ``` ## Training The exact model architecture and training script can be found on [GitHub](https://github.com/fraserlove/gpt-alpha). GPT-α uses the GPT-2 tokeniser. The model was trained on 40B tokens over 76,296 iterations using a cosine learning rate schedule is used, with a warmup period of 375M tokens. A max learning rate of 18e-4 (3x that of GPT-3) is used with a linear decay over the training period. Overall, training lasted for a continuous 11.5 hours on 8× A100-SMX4 40GB GPUs running at a pace of 1.07M tokens per second when using a batch size of 16. The model surpassed GPT-3 124M on [HellaSwag](https://arxiv.org/pdf/1905.07830) after just 38B tokens, this is a 7.8x improvement over GPT-3 which was trained on 300B tokens. The final model at 40B tokens achieved a HellaSwag score of 0.339. ## Inference The model can be directly used with a pipeline for text generation: ```python >>> from transformers import pipeline, set_seed >>> generator = pipeline('text-generation', model='fraserlove/gpt-alpha') >>> set_seed(0) >>> generator('Once upon a time,', max_length=30, num_return_sequences=5, do_sample=True) [{'generated_text': 'Once upon a time, my father had some way that would help him win his first war. There was a man named John. He was the husband'}, {'generated_text': 'Once upon a time, this particular breed would be considered a “chicken fan”; today, the breed is classified as a chicken.'}, {'generated_text': 'Once upon a time, there was a famous English nobleman named King Arthur (in the Middle Ages, it was called ‘the Arthur’'}, {'generated_text': "Once upon a time, the Christian God created the world in the manner which, under different circumstances, was true of the world's existence. The universe"}, {'generated_text': 'Once upon a time, I wrote all of the letters of an alphabets in a single document. Then I read each letter of that alphabet'}] ``` The model can also be used directly for inference: ```python from transformers import AutoTokenizer, AutoModelForCausalLM device = 'cuda' # for GPU usage or 'cpu' for CPU usage tokeniser = AutoTokenizer.from_pretrained('fraserlove/gpt-alpha') # For multi-GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='auto')` model = AutoModelForCausalLM.from_pretrained('fraserlove/gpt-alpha').to(device) context = tokeniser.encode('Once upon a time,', return_tensors='pt').to(device) samples = model.generate(context, do_sample=True) print(tokeniser.decode(samples[0])) ``` To get the features of a given text: ```python from transformers import AutoTokenizer, AutoModelForCausalLM device = 'cuda' # for GPU usage or 'cpu' for CPU usage tokeniser = AutoTokenizer.from_pretrained('fraserlove/gpt-alpha') model = AutoModelForCausalLM.from_pretrained('fraserlove/gpt-alpha').to(device) encoded = tokeniser('Once upon a time,', return_tensors='pt').to(device) output = model(**encoded) ``` ## Evaluation | Benchmark | GPT-α 124M | GPT-2 124M | GPT-Neo 125M | OPT 125M | Pythia 160M | |----------------------|:----------:|:----------:|:------------:|:----------:|:-----------:| | PIQA | **63.06%** | 62.51% | 62.46% | 62.08% | 61.26% | | SIQA | **38.18%** | 36.59% | 37.21% | 37.21% | 36.69% | | OpenBookQA | **29.80%** | 27.20% | 26.20% | 28.00% | 27.00% | | TriviaQA | **1.31%** | 0.30% | 0.66% | 1.18% | 0.41% | | TruthfulQA | 33.13% | 31.73% | **35.70%** | 33.50% | 34.75% | | MMLU | 23.30% | 25.90% | 25.58% | **25.94%** | 25.10% | | WinoGrande | 50.20% | 50.04% | **51.70%** | 51.07% | 48.78% | | ARC Challenge | **29.18%** | 22.95% | 22.87% | 22.10% | 22.10% | | HellaSwag | **35.74%** | 31.64% | 30.58% | 31.69% | 30.15% | | GSM-8K | **2.27%** | 0.68% | 1.74% | 1.74% | 2.20% | | **Average Score** | **30.62%** | 28.95% | 29.47% | 29.45% | 28.84% |