ybelkada ArthurZ HF staff commited on
Commit
b5c5730
1 Parent(s): 45377e4

Update README.md (#2)

Browse files

- Update README.md (789a1c780444af9a54cec6f3e3ac0e1e4cfb982d)


Co-authored-by: Arthur Zucker <[email protected]>

Files changed (1) hide show
  1. README.md +42 -13
README.md CHANGED
@@ -39,6 +39,10 @@ widget:
39
  It's not certain how many lessons you'll learn by your thirties. Does the
40
  premise entail the hypothesis?
41
  example_title: Premise and hypothesis
 
 
 
 
42
  tags:
43
  - text2text-generation
44
  datasets:
@@ -56,17 +60,21 @@ datasets:
56
  license: apache-2.0
57
  ---
58
 
59
- # TL;DR FLan-UL2 improvements over previous version
60
- The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
61
- This Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
62
-
63
- The original UL2 model also had mode switch tokens that was rather mandatory to get good performance.
64
- However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
65
 
66
- # Converting from T5x to huggingface
67
- You can use the [`convert_`]() and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, we used an identity layer.
 
 
68
 
69
- # Performance improvment
 
 
 
 
 
70
 
71
  The reported results are the following :
72
  | | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
@@ -76,8 +84,26 @@ The reported results are the following :
76
  | FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
77
  | FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- # Introduction
 
81
 
82
  UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
83
 
@@ -95,9 +121,12 @@ Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal
95
 
96
  # Training
97
 
98
- ## Flan UL2, a 20B Flan trained UL2 model
99
  The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
100
 
 
 
 
101
 
102
  ## UL2 PreTraining
103
 
@@ -113,7 +142,7 @@ UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](http
113
 
114
  The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
115
 
116
- ## Mixture of Denoisers
117
 
118
  To quote the paper:
119
  > We conjecture that a strong universal model has to be exposed to solving diverse set of problems
@@ -164,7 +193,7 @@ In total, the model was trained for 2.65 million steps.
164
 
165
  ## Contribution
166
 
167
- This model was contributed by [Younes Belkada](https://huggingface.co/Seledorn) & [Arthur Zucker]().
168
 
169
  ## Examples
170
 
 
39
  It's not certain how many lessons you'll learn by your thirties. Does the
40
  premise entail the hypothesis?
41
  example_title: Premise and hypothesis
42
+ - text: >-
43
+ Answer the following question by reasoning step by step.
44
+ The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?
45
+ example_title: Chain of thought
46
  tags:
47
  - text2text-generation
48
  datasets:
 
60
  license: apache-2.0
61
  ---
62
 
63
+ # TL;DR FLan-UL2
64
+ Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2) released earlier last year. It was fine tuned using the "Flan" prompt tuning
65
+ and dataset collection.
 
 
 
66
 
67
+ According ot the original [blog]() here are the notable improvements:
68
+ - The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
69
+ - The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
70
+ - The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
71
 
72
+ ## Converting from T5x to huggingface
73
+ You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `stric=False` argument.
74
+ ```bash
75
+ python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path ~/code/ul2/flan-ul220b-v3/ --config_file config.json --pytorch_dump_path ~/code/ul2/flan-ul2
76
+ ```
77
+ ## Performance improvment
78
 
79
  The reported results are the following :
80
  | | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
 
84
  | FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
85
  | FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
86
 
87
+ # Using the model
88
+
89
+ ```python
90
+ from transformers import AutoModelForConditionalGeneration, AutoTokenizer
91
+ import torch
92
+ model = AutoModelForConditionalGeneration.from_pretrained("google/flan-ul2", device_map="auto", load_in_8bits = True)
93
+ tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
94
+
95
+ input_string = "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?"
96
+
97
+ inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
98
+ outputs = model.generate(inputs, max_length=200)
99
+
100
+ print(tokenizer.decode(outputs[0]))
101
+ # <pad> They have 23 - 20 = 3 apples left. They have 3 + 6 = 9 apples. Therefore, the answer is 9.</s>
102
+
103
+ ```
104
 
105
+
106
+ # Introduction to UL2
107
 
108
  UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
109
 
 
121
 
122
  # Training
123
 
124
+ ## Flan UL2
125
  The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
126
 
127
+ In “Scaling Instruction-Finetuned language models (Chung et al.)�� (also referred to sometimes as the Flan2 paper), the key idea is to train a large language model on a collection of datasets. These datasets are phrased as instructions which enable generalization across diverse tasks. Flan has been primarily trained on academic tasks. In Flan2, we released a series of T5 models ranging from 200M to 11B parameters that have been instruction tuned with Flan.
128
+
129
+ The Flan datasets have also been open sourced in “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning” (Longpre et al.). See Google AI Blogpost: “The Flan Collection: Advancing Open Source Methods for Instruction Tuning”.
130
 
131
  ## UL2 PreTraining
132
 
 
142
 
143
  The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
144
 
145
+ ### Mixture of Denoisers
146
 
147
  To quote the paper:
148
  > We conjecture that a strong universal model has to be exposed to solving diverse set of problems
 
193
 
194
  ## Contribution
195
 
196
+ This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) & [Arthur Zucker](https://huggingface.co/ArthurZ).
197
 
198
  ## Examples
199