Update README.md (#2)
Browse files- Update README.md (789a1c780444af9a54cec6f3e3ac0e1e4cfb982d)
Co-authored-by: Arthur Zucker <[email protected]>
README.md
CHANGED
@@ -39,6 +39,10 @@ widget:
|
|
39 |
It's not certain how many lessons you'll learn by your thirties. Does the
|
40 |
premise entail the hypothesis?
|
41 |
example_title: Premise and hypothesis
|
|
|
|
|
|
|
|
|
42 |
tags:
|
43 |
- text2text-generation
|
44 |
datasets:
|
@@ -56,17 +60,21 @@ datasets:
|
|
56 |
license: apache-2.0
|
57 |
---
|
58 |
|
59 |
-
# TL;DR FLan-UL2
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
The original UL2 model also had mode switch tokens that was rather mandatory to get good performance.
|
64 |
-
However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
|
65 |
|
66 |
-
|
67 |
-
|
|
|
|
|
68 |
|
69 |
-
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
The reported results are the following :
|
72 |
| | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
|
@@ -76,8 +84,26 @@ The reported results are the following :
|
|
76 |
| FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
|
77 |
| FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
|
78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
|
80 |
-
|
|
|
81 |
|
82 |
UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
|
83 |
|
@@ -95,9 +121,12 @@ Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal
|
|
95 |
|
96 |
# Training
|
97 |
|
98 |
-
## Flan UL2
|
99 |
The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
|
100 |
|
|
|
|
|
|
|
101 |
|
102 |
## UL2 PreTraining
|
103 |
|
@@ -113,7 +142,7 @@ UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](http
|
|
113 |
|
114 |
The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
|
115 |
|
116 |
-
|
117 |
|
118 |
To quote the paper:
|
119 |
> We conjecture that a strong universal model has to be exposed to solving diverse set of problems
|
@@ -164,7 +193,7 @@ In total, the model was trained for 2.65 million steps.
|
|
164 |
|
165 |
## Contribution
|
166 |
|
167 |
-
This model was contributed by [Younes Belkada](https://huggingface.co/
|
168 |
|
169 |
## Examples
|
170 |
|
|
|
39 |
It's not certain how many lessons you'll learn by your thirties. Does the
|
40 |
premise entail the hypothesis?
|
41 |
example_title: Premise and hypothesis
|
42 |
+
- text: >-
|
43 |
+
Answer the following question by reasoning step by step.
|
44 |
+
The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?
|
45 |
+
example_title: Chain of thought
|
46 |
tags:
|
47 |
- text2text-generation
|
48 |
datasets:
|
|
|
60 |
license: apache-2.0
|
61 |
---
|
62 |
|
63 |
+
# TL;DR FLan-UL2
|
64 |
+
Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2) released earlier last year. It was fine tuned using the "Flan" prompt tuning
|
65 |
+
and dataset collection.
|
|
|
|
|
|
|
66 |
|
67 |
+
According ot the original [blog]() here are the notable improvements:
|
68 |
+
- The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
|
69 |
+
- The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
|
70 |
+
- The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
|
71 |
|
72 |
+
## Converting from T5x to huggingface
|
73 |
+
You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `stric=False` argument.
|
74 |
+
```bash
|
75 |
+
python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path ~/code/ul2/flan-ul220b-v3/ --config_file config.json --pytorch_dump_path ~/code/ul2/flan-ul2
|
76 |
+
```
|
77 |
+
## Performance improvment
|
78 |
|
79 |
The reported results are the following :
|
80 |
| | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
|
|
|
84 |
| FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
|
85 |
| FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
|
86 |
|
87 |
+
# Using the model
|
88 |
+
|
89 |
+
```python
|
90 |
+
from transformers import AutoModelForConditionalGeneration, AutoTokenizer
|
91 |
+
import torch
|
92 |
+
model = AutoModelForConditionalGeneration.from_pretrained("google/flan-ul2", device_map="auto", load_in_8bits = True)
|
93 |
+
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
|
94 |
+
|
95 |
+
input_string = "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?"
|
96 |
+
|
97 |
+
inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
|
98 |
+
outputs = model.generate(inputs, max_length=200)
|
99 |
+
|
100 |
+
print(tokenizer.decode(outputs[0]))
|
101 |
+
# <pad> They have 23 - 20 = 3 apples left. They have 3 + 6 = 9 apples. Therefore, the answer is 9.</s>
|
102 |
+
|
103 |
+
```
|
104 |
|
105 |
+
|
106 |
+
# Introduction to UL2
|
107 |
|
108 |
UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
|
109 |
|
|
|
121 |
|
122 |
# Training
|
123 |
|
124 |
+
## Flan UL2
|
125 |
The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
|
126 |
|
127 |
+
In “Scaling Instruction-Finetuned language models (Chung et al.)�� (also referred to sometimes as the Flan2 paper), the key idea is to train a large language model on a collection of datasets. These datasets are phrased as instructions which enable generalization across diverse tasks. Flan has been primarily trained on academic tasks. In Flan2, we released a series of T5 models ranging from 200M to 11B parameters that have been instruction tuned with Flan.
|
128 |
+
|
129 |
+
The Flan datasets have also been open sourced in “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning” (Longpre et al.). See Google AI Blogpost: “The Flan Collection: Advancing Open Source Methods for Instruction Tuning”.
|
130 |
|
131 |
## UL2 PreTraining
|
132 |
|
|
|
142 |
|
143 |
The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
|
144 |
|
145 |
+
### Mixture of Denoisers
|
146 |
|
147 |
To quote the paper:
|
148 |
> We conjecture that a strong universal model has to be exposed to solving diverse set of problems
|
|
|
193 |
|
194 |
## Contribution
|
195 |
|
196 |
+
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) & [Arthur Zucker](https://huggingface.co/ArthurZ).
|
197 |
|
198 |
## Examples
|
199 |
|