metadata

language:
  - en
tags:
  - pytorch
  - causal-lm
  - pythia
license: apache-2.0
datasets:
  - EleutherAI/pile

Model Card for Model ID

The Pythia 160m model is part of a collection of models developed to facilitate interpretability research (see repository) trained on the Pile. We have evalutated it on hellaswag using the Eleuther evaluation harness.

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.2872	±	0.0045
		none	0	acc_norm	↑	0.3082	±	0.0046

Evaluation results.

Model Details

Developed by: EleutherAI
Model type: Transformer-based Language Model
Language: English
Learn more: Pythia's GitHub repository for training procedure, config files, and details on how to use. See paper for more evals and implementation details.
Library: GPT-NeoX
License: Apache 2.0
Contact: to ask questions about this model, join the EleutherAI Discord, and post them in #release-discussion. Please read the existing Pythia documentation before asking about it in the EleutherAI Discord. For general correspondence: contact@eleuther. ai.

Pythia model	Non-Embedding Params	Layers	Model Dim	Heads	Batch Size	Learning Rate	Equivalent Models
160M	85,056,000	12	768	12	2M	6.0 x 10^-4	GPT-Neo 125M, OPT-125M

Engineering details for the Pythia Suite. Deduped and non-deduped models of a given size have the same hyperparameters. “Equivalent” models have exactly the same architecture, and the same number of non-embedding parameters.

Model Description

This is the model card of Pythia 160m evaluated on the Eleuther evaluation harness.

Developed by: EleutherAI
Model type: Pythia 160m
Language(s) (NLP): EN
License: Apache 2.0

Model Sources

Repository: https://huggingface.co/EleutherAI/pythia-160m/edit/main/README.md

Uses and Limitations

Intended Use

The primary intended use of Pythia is research on the behavior, functionality, and limitations of large language models. This suite is intended to provide a controlled setting for performing scientific experiments. We also provide 154 checkpoints per model: initial step0, 10 log-spaced checkpoints step{1,2,4...512}, and 143 evenly-spaced checkpoints from step1000 to step143000. These checkpoints are hosted on Hugging Face as branches. Note that branch 143000 corresponds exactly to the model checkpoint on the main branch of each model.

You may also further fine-tune and adapt Pythia-160M for deployment, as long as your use is in accordance with the Apache 2.0 license. Pythia models work with the Hugging Face Transformers Library. If you decide to use pre-trained Pythia-160M as a basis for your fine-tuned model, please conduct your own risk and bias assessment.

Out-of-scope use

The Pythia Suite is not intended for deployment. It is not a in itself a product and cannot be used for human-facing interactions. For example, the model may generate harmful or offensive text. Please evaluate the risks associated with your particular use case.

Pythia models are English-language only, and are not suitable for translation or generating text in other languages.

Pythia-160M has not been fine-tuned for downstream contexts in which language models are commonly deployed, such as writing genre prose, or commercial chatbots. This means Pythia-160M will not respond to a given prompt the way a product like ChatGPT does. This is because, unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human Feedback (RLHF) to better “follow” human instructions.

Limitations and biases

The core functionality of a large language model is to take a string of text and predict the next token. The token used by the model need not produce the most “accurate” text. Never rely on Pythia-160M to produce factually accurate output.

This model was trained on the Pile, a dataset known to contain profanity and texts that are lewd or otherwise offensive. See Section 6 of the Pile paper for a discussion of documented biases with regards to gender, religion, and race. Pythia-160M may produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

If you plan on using text generated through, for example, the Hosted Inference API, we recommend having a human curate the outputs of this language model before presenting it to other people. Please inform your audience that the text was generated by Pythia-160M.

Training

Training data

The Pile is a 825GiB general-purpose dataset in English. It was created by EleutherAI specifically for training large language models. It contains texts from 22 diverse sources, roughly broken down into five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub, Enron Emails). See the Pile paper for a breakdown of all data sources, methodology, and a discussion of ethical implications. Consult the datasheet for more detailed documentation about the Pile and its component datasets. The Pile can be downloaded from the official website, or from a community mirror.
The Pile was not deduplicated before being used to train Pythia-160M.

Training procedure

All models were trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 tokens during training, and 143 checkpoints for each model are saved every 2,097,152,000 tokens, spaced evenly throughout training, from step1000 to step143000 (which is the same as main). In addition, we also provide frequent early checkpoints: step0 and step{1,2,4...512}. This corresponds to training for just under 1 epoch on the Pile for non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.

All Pythia models trained for 143000 steps at a batch size of 2M (2,097,152 tokens).
See GitHub for more details on training procedure, including how to reproduce it.
Pythia uses the same tokenizer as GPT-NeoX- 20B.

Evaluation

This model has been evaluated on hellaswag using the Eleuther evaluation harness.

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.2872	±	0.0045
		none	0	acc_norm	↑	0.3082	±	0.0046

Evaluation results.