olm-gpt2-dec-2022 / README.md
Tristan's picture
Create README.md
6cf19d8
|
raw
history blame
5.29 kB
metadata
language: en
tags:
  - exbert

OLM GPT-2 December 2022

This is a more up-to-date version of the original GPT-2. In addition to being more up-to-date, it also tends to perform better than the original GPT2 on standard benchmarks. It was trained on a cleaned December 2022 snapshot of Common Crawl and Wikipedia.

This model was created as part of the OLM project, which has the goal of continuously training and releasing models that are up-to-date and comparable in standard language model performance to their static counterparts. This is important because we want our models to know about events like COVID or a presidential election right after they happen.

Intended uses

You can use the raw model for text generation or fine-tune it to a downstream task.

How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> # It is important to include the bad_words_ids=[[0,2]] if you want this model to stay on topic.
>>> # Otherwise, the model may generate start and end tokens followed by text that is not relevant to
>>> # the previous text.
>>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
>>> set_seed(42)
>>> # This example also illustrates that sometimes our model generates
>>> # bloggy/spammy/webb-y things, even though it gets higher evaluation results
>>> # than the original GPT-2 accross a variety of benchmarks. See the first output.
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
TODO

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('olm/olm-gpt2-dec-2022')
model = AutoModelForCausalLM.from_pretrained('olm/olm-gpt2-dec-2022')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Dataset

The model and tokenizer were trained with this December 2022 cleaned Common Crawl dataset plus this December 2022 cleaned Wikipedia dataset.
The tokenized version of these concatenated datasets is here.
The datasets were created with this repo.

Training

The model was trained according to the OLM GPT2 instructions at this repo.

Evaluation results

The model achieves the following results without any fine-tuning (zero-shot):

Task Metric Original GPT2 OLM GPT2 Dec 2022 (Ours) Significance of Difference (two-tailed p-value)
rte acc 0.5307 0.5199
piqa acc/acc_norm 0.6289/0.6251 0.6692/0.6665
copa acc 0.6400 0.6800
record f1/em 0.7094/0.7026 0.6884/0.6818
boolq acc 0.4872 0.6021
cb acc/f1 0.4101/0.2619 0.3393/0.1840 /NA
hellaswag acc/acc_norm 0.2892/0.3114 0.3079/0.3482
mrpc acc/f1 0.5662/0.6911 0.6814/0.8099
multirc acc 0.0189 0.0220
lambada ppl/acc 40.0554/0.3256 28.3359/0.3699
wsc acc 0.4327 0.3654
wic acc 0.4922 0.5000
mnli acc 0.3372 0.3501
qnli acc 0.5017 0.4946
cola mcc 0.0126 0.0000
triviaqa acc 0.0151 0.0181
winogrande acc 0.5162 0.5051
webqs acc 0.0030 0.0079
arc_easy acc/acc_norm 0.4381/0.3948 0.4693/0.4230
arc_challenge acc/acc_norm 0.1903/0.2270 0.2090/0.2398

To get these results, we used the Eleuther AI evaluation harness here, which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.