google
/

flan-ul2

@@ -65,6 +65,17 @@ license: apache-2.0
 ![model image](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)
 Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2)  released earlier last year. It was fine tuned using the "Flan" prompt tuning
 and dataset collection.
@@ -73,22 +84,16 @@ According ot the original [blog](https://www.yitay.net/blog/flan-ul2-20b) here a
 - The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
 - The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
 ## Converting from T5x to huggingface
 You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `stric=False` argument.
 ```bash
 python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path ~/code/ul2/flan-ul220b-v3/ --config_file config.json --pytorch_dump_path ~/code/ul2/flan-ul2
 ```
-## Performance improvment
-The reported results are the following :
-|  | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
-| :--- | :---: | :---: | :---: | :---: | :---: |
-| FLAN-PaLM 62B | 59.6 | 47.5 | 56.9 | 44.9 | 49.9 |
-| FLAN-PaLM 540B | 73.5 | 57.9 | 70.9 | 66.3 | 67.2 |
-| FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
-| FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
-# Using the model
 For more efficient memory usage, we advise you to load the model in `8bit` using `load_in_8bit` flag as follows:
@@ -126,9 +131,23 @@ print(tokenizer.decode(outputs[0]))
 # <pad> They have 23 - 20 = 3 apples left. They have 3 + 6 = 9 apples. Therefore, the answer is 9.</s>
 ```
 # Introduction to UL2
 UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
 ![model image](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)
@@ -143,9 +162,10 @@ Paper: [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1
 Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler*
-# Training
-## Flan UL2
 The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
 In “Scaling Instruction-Finetuned language models (Chung et al.)” (also referred to sometimes as the Flan2 paper), the key idea is to train a large language model on a collection of datasets. These datasets are phrased as instructions which enable generalization across diverse tasks. Flan has been primarily trained on academic tasks. In Flan2, we released a series of T5 models ranging from 200M to 11B parameters that have been instruction tuned with Flan.
@@ -215,77 +235,10 @@ In total, the model was trained for 2.65 million steps.
 **Important**: For more details, please see sections 5.2.1 and 5.2.2 of the [paper](https://arxiv.org/pdf/2205.05131v1.pdf).
-## Contribution
 This model was originally contributed by [Yi Tay](https://www.yitay.net/?author=636616684c5e64780328eece), and added to the Hugging Face ecosystem by [Younes Belkada](https://huggingface.co/ybelkada) & [Arthur Zucker](https://huggingface.co/ArthurZ).
-## Examples
-The following shows how one can predict masked passages using the different denoising strategies.
-Given the size of the model the following examples need to be run on at least a 40GB A100 GPU.
-### S-Denoising
-For *S-Denoising*, please make sure to prompt the text with the prefix `[S2S]` as shown below.
-```python
-from transformers import T5ForConditionalGeneration, AutoTokenizer
-import torch
-model = T5ForConditionalGeneration.from_pretrained("google/ul2", low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to("cuda")
-tokenizer = AutoTokenizer.from_pretrained("google/ul2")
-input_string = "[S2S] Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, solid man with a bald head. Mrs. Dursley was thin and blonde and more than the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere <extra_id_0>"
-inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
-outputs = model.generate(inputs, max_length=200)
-print(tokenizer.decode(outputs[0]))
-# -> <pad>. Dudley was a very good boy, but he was also very stupid.</s>
-```
-### R-Denoising
-For *R-Denoising*, please make sure to prompt the text with the prefix `[NLU]` as shown below.
-```python
-from transformers import T5ForConditionalGeneration, AutoTokenizer
-import torch
-model = T5ForConditionalGeneration.from_pretrained("google/ul2", low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to("cuda")
-tokenizer = AutoTokenizer.from_pretrained("google/ul2")
-input_string = "[NLU] Mr. Dursley was the director of a firm called <extra_id_0>, which made <extra_id_1>. He was a big, solid man with a bald head. Mrs. Dursley was thin and <extra_id_2> of neck, which came in very useful as she spent so much of her time <extra_id_3>. The Dursleys had a small son called Dudley and <extra_id_4>"
-inputs = tokenizer(input_string, return_tensors="pt", add_special_tokens=False).input_ids.to("cuda")
-outputs = model.generate(inputs, max_length=200)
-print(tokenizer.decode(outputs[0]))
-# -> "<pad><extra_id_0> Burrows<extra_id_1> brooms for witches and wizards<extra_id_2> had a lot<extra_id_3> scolding Dudley<extra_id_4> a daughter called Petunia. Dudley was a nasty, spoiled little boy who was always getting into trouble. He was very fond of his pet rat, Scabbers.<extra_id_5> Burrows<extra_id_3> screaming at him<extra_id_4> a daughter called Petunia</s>
-"
-```
-### X-Denoising
-For *X-Denoising*, please make sure to prompt the text with the prefix `[NLG]` as shown below.
-```python
-from transformers import T5ForConditionalGeneration, AutoTokenizer
-import torch
-model = T5ForConditionalGeneration.from_pretrained("google/ul2", low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to("cuda")
-tokenizer = AutoTokenizer.from_pretrained("google/ul2")
-input_string = "[NLG] Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, solid man wiht a bald head. Mrs. Dursley was thin and blonde and more than the usual amount of neck, which came in very useful as she
-spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. <extra_id_0>"
-model.cuda()
-inputs = tokenizer(input_string, return_tensors="pt", add_special_tokens=False).input_ids.to("cuda")
-outputs = model.generate(inputs, max_length=200)
-print(tokenizer.decode(outputs[0]))
-# -> "<pad><extra_id_0> Burrows<extra_id_1> a lot of money from the manufacture of a product called '' Burrows'''s ''<extra_id_2> had a lot<extra_id_3> looking down people's throats<extra_id_4> a daughter called Petunia. Dudley was a very stupid boy who was always getting into trouble. He was a big, fat, ugly boy who was always getting into trouble. He was a big, fat, ugly boy who was always getting into trouble. He was a big, fat, ugly boy who was always getting into trouble. He was a big, fat, ugly boy who was always getting into trouble. He was a big, fat, ugly boy who was always getting into trouble. He was a big, fat, ugly boy who was always getting into trouble. He was a big, fat, ugly boy who was always getting into trouble. He was a big, fat,"
-```

 ![model image](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)
+#  Table of Contents
+0. [TL;DR](#TL;DR)
+1. [Using the model](#using-the-model)
+2. [Results](#results)
+3. [Introduction to UL2](#introduction-to-ul2)
+4. [Contribution](#contribution)
+5. [Citation](#citation)
+# TL;DR
 Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2)  released earlier last year. It was fine tuned using the "Flan" prompt tuning
 and dataset collection.
 - The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
 - The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
+# Using the model
 ## Converting from T5x to huggingface
 You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `stric=False` argument.
 ```bash
 python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path ~/code/ul2/flan-ul220b-v3/ --config_file config.json --pytorch_dump_path ~/code/ul2/flan-ul2
 ```
+## Running the model
 For more efficient memory usage, we advise you to load the model in `8bit` using `load_in_8bit` flag as follows:
 # <pad> They have 23 - 20 = 3 apples left. They have 3 + 6 = 9 apples. Therefore, the answer is 9.</s>
 ```
+# Results
+## Performance improvment
+The reported results are the following :
+|  | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| FLAN-PaLM 62B | 59.6 | 47.5 | 56.9 | 44.9 | 49.9 |
+| FLAN-PaLM 540B | 73.5 | 57.9 | 70.9 | 66.3 | 67.2 |
+| FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
+| FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
 # Introduction to UL2
+This entire section has been copied from the [`google/ul2`](https://huggingface.co/google/ul2) model card and might be subject of change with respect to `flan-ul2`.
 UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
 ![model image](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)
 Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler*
+## Training
+### Flan UL2
 The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
 In “Scaling Instruction-Finetuned language models (Chung et al.)” (also referred to sometimes as the Flan2 paper), the key idea is to train a large language model on a collection of datasets. These datasets are phrased as instructions which enable generalization across diverse tasks. Flan has been primarily trained on academic tasks. In Flan2, we released a series of T5 models ranging from 200M to 11B parameters that have been instruction tuned with Flan.
 **Important**: For more details, please see sections 5.2.1 and 5.2.2 of the [paper](https://arxiv.org/pdf/2205.05131v1.pdf).
+# Contribution
 This model was originally contributed by [Yi Tay](https://www.yitay.net/?author=636616684c5e64780328eece), and added to the Hugging Face ecosystem by [Younes Belkada](https://huggingface.co/ybelkada) & [Arthur Zucker](https://huggingface.co/ArthurZ).
+# Citation
+If you want to cite this work, please consider citing the [blogpost](https://www.yitay.net/blog/flan-ul2-20b) announcing the release of `Flan-UL2`.