vansin commited on
Commit
a6d4e69
0 Parent(s):
.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ datasets:
5
+ - c4
6
+ tags:
7
+ - deep-narrow
8
+ inference: false
9
+
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # T5-Efficient-TINY (Deep-Narrow version)
14
+
15
+ T5-Efficient-TINY is a variation of [Google's original T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) following the [T5 model architecture](https://huggingface.co/docs/transformers/model_doc/t5).
16
+ It is a *pretrained-only* checkpoint and was released with the
17
+ paper **[Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers](https://arxiv.org/abs/2109.10686)**
18
+ by *Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler*.
19
+
20
+ In a nutshell, the paper indicates that a **Deep-Narrow** model architecture is favorable for **downstream** performance compared to other model architectures
21
+ of similar parameter count.
22
+
23
+ To quote the paper:
24
+
25
+ > We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased
26
+ > before considering any other forms of uniform scaling across other dimensions. This is largely due to
27
+ > how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a
28
+ > tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise,
29
+ > a tall base model might also generally more efficient compared to a large model. We generally find
30
+ > that, regardless of size, even if absolute performance might increase as we continue to stack layers,
31
+ > the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36
32
+ > layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e.,
33
+ > params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params,
34
+ > FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to
35
+ > consider.
36
+
37
+ To be more precise, *model depth* is defined as the number of transformer blocks that are stacked sequentially.
38
+ A sequence of word embeddings is therefore processed sequentially by each transformer block.
39
+
40
+ ## Details model architecture
41
+
42
+ This model checkpoint - **t5-efficient-tiny** - is of model type **Tiny** with no variations.
43
+ It has **15.58** million parameters and thus requires *ca.* **62.32 MB** of memory in full precision (*fp32*)
44
+ or **31.16 MB** of memory in half precision (*fp16* or *bf16*).
45
+
46
+ A summary of the *original* T5 model architectures can be seen here:
47
+
48
+ | Model | nl (el/dl) | ff | dm | kv | nh | #Params|
49
+ | ----| ---- | ---- | ---- | ---- | ---- | ----|
50
+ | Tiny | 4/4 | 1024 | 256 | 32 | 4 | 16M|
51
+ | Mini | 4/4 | 1536 | 384 | 32 | 8 | 31M|
52
+ | Small | 6/6 | 2048 | 512 | 32 | 8 | 60M|
53
+ | Base | 12/12 | 3072 | 768 | 64 | 12 | 220M|
54
+ | Large | 24/24 | 4096 | 1024 | 64 | 16 | 738M|
55
+ | Xl | 24/24 | 16384 | 1024 | 128 | 32 | 3B|
56
+ | XXl | 24/24 | 65536 | 1024 | 128 | 128 | 11B|
57
+
58
+ whereas the following abbreviations are used:
59
+
60
+ | Abbreviation | Definition |
61
+ | ----| ---- |
62
+ | nl | Number of transformer blocks (depth) |
63
+ | dm | Dimension of embedding vector (output vector of transformers block) |
64
+ | kv | Dimension of key/value projection matrix |
65
+ | nh | Number of attention heads |
66
+ | ff | Dimension of intermediate vector within transformer block (size of feed-forward projection matrix) |
67
+ | el | Number of transformer blocks in the encoder (encoder depth) |
68
+ | dl | Number of transformer blocks in the decoder (decoder depth) |
69
+ | sh | Signifies that attention heads are shared |
70
+ | skv | Signifies that key-values projection matrices are tied |
71
+
72
+ If a model checkpoint has no specific, *el* or *dl* than both the number of encoder- and decoder layers correspond to *nl*.
73
+
74
+ ## Pre-Training
75
+
76
+ The checkpoint was pretrained on the [Colossal, Cleaned version of Common Crawl (C4)](https://huggingface.co/datasets/c4) for 524288 steps using
77
+ the span-based masked language modeling (MLM) objective.
78
+
79
+ ## Fine-Tuning
80
+
81
+ **Note**: This model is a **pretrained** checkpoint and has to be fine-tuned for practical usage.
82
+ The checkpoint was pretrained in English and is therefore only useful for English NLP tasks.
83
+ You can follow on of the following examples on how to fine-tune the model:
84
+
85
+ *PyTorch*:
86
+
87
+ - [Summarization](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization)
88
+ - [Question Answering](https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/run_seq2seq_qa.py)
89
+ - [Text Classification](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification) - *Note*: You will have to slightly adapt the training example here to make it work with an encoder-decoder model.
90
+
91
+ *Tensorflow*:
92
+
93
+ - [Summarization](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/summarization)
94
+ - [Text Classification](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification) - *Note*: You will have to slightly adapt the training example here to make it work with an encoder-decoder model.
95
+
96
+ *JAX/Flax*:
97
+
98
+ - [Summarization](https://github.com/huggingface/transformers/tree/master/examples/flax/summarization)
99
+ - [Text Classification](https://github.com/huggingface/transformers/tree/master/examples/flax/text-classification) - *Note*: You will have to slightly adapt the training example here to make it work with an encoder-decoder model.
100
+
101
+ ## Downstream Performance
102
+
103
+ TODO: Add table if available
104
+
105
+ ## Computational Complexity
106
+
107
+ TODO: Add table if available
108
+
109
+ ## More information
110
+
111
+ We strongly recommend the reader to go carefully through the original paper **[Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers](https://arxiv.org/abs/2109.10686)** to get a more nuanced understanding of this model checkpoint.
112
+ As explained in the following [issue](https://github.com/google-research/google-research/issues/986#issuecomment-1035051145), checkpoints including the *sh* or *skv*
113
+ model architecture variations have *not* been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept [here](https://huggingface.co/NewT5SharedHeadsSharedKeyValues) as they might be ported potentially in the future.
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "t5-efficient-tiny",
3
+ "architectures": [
4
+ "T5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 1024,
7
+ "d_kv": 64,
8
+ "d_model": 256,
9
+ "decoder_start_token_id": 0,
10
+ "dropout_rate": 0.1,
11
+ "eos_token_id": 1,
12
+ "feed_forward_proj": "relu",
13
+ "initializer_factor": 1.0,
14
+ "is_encoder_decoder": true,
15
+ "layer_norm_epsilon": 1e-06,
16
+ "model_type": "t5",
17
+ "n_positions": 512,
18
+ "num_decoder_layers": 4,
19
+ "num_heads": 4,
20
+ "num_layers": 4,
21
+ "pad_token_id": 0,
22
+ "relative_attention_num_buckets": 32,
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.17.0.dev0",
25
+ "use_cache": true,
26
+ "vocab_size": 32128
27
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16e959d18596c0cdd9da07fd4fb913f5f9aa7835decfd6a4b9f9fd96e960da26
3
+ size 62286648
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "decoder_start_token_id": 0,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.27.0.dev0"
7
+ }
operative_config.gin ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import mesh_tensorflow.optimize
2
+ import mesh_tensorflow.transformer.dataset
3
+ import mesh_tensorflow.transformer.learning_rate_schedules
4
+ import mesh_tensorflow.transformer.t2t_vocabulary
5
+ import mesh_tensorflow.transformer.transformer
6
+ import mesh_tensorflow.transformer.transformer_layers
7
+ import mesh_tensorflow.transformer.utils
8
+ import t5.models.mesh_transformer
9
+
10
+ # Macros:
11
+ # ==============================================================================
12
+ d_ff = 1024
13
+ d_kv = 64
14
+ d_model = 256
15
+ dropout_rate = 0.0
16
+ inputs_length = 512
17
+ mean_noise_span_length = 3.0
18
+ MIXTURE_NAME = 'c4_v220_unsupervised'
19
+ noise_density = 0.15
20
+ num_heads = 4
21
+ num_layers = 4
22
+
23
+ # Parameters for adafactor_decay_rate_pow:
24
+ # ==============================================================================
25
+ adafactor_decay_rate_pow.offset = 0
26
+
27
+ # Parameters for AdafactorOptimizer:
28
+ # ==============================================================================
29
+ AdafactorOptimizer.beta1 = 0.0
30
+ AdafactorOptimizer.clipping_threshold = 1.0
31
+ AdafactorOptimizer.decay_rate = None
32
+ AdafactorOptimizer.epsilon1 = 1e-30
33
+ AdafactorOptimizer.epsilon2 = 0.001
34
+ AdafactorOptimizer.factored = True
35
+ AdafactorOptimizer.min_dim_size_to_factor = 128
36
+ AdafactorOptimizer.multiply_by_parameter_scale = True
37
+
38
+ # Parameters for Bitransformer:
39
+ # ==============================================================================
40
+ Bitransformer.shared_embedding = True
41
+
42
+ # Parameters for denoise:
43
+ # ==============================================================================
44
+ denoise.inputs_fn = @preprocessors.noise_span_to_unique_sentinel
45
+ denoise.noise_density = %noise_density
46
+ denoise.noise_mask_fn = @preprocessors.random_spans_noise_mask
47
+ denoise.targets_fn = @preprocessors.nonnoise_span_to_unique_sentinel
48
+
49
+ # Parameters for decoder/DenseReluDense:
50
+ # ==============================================================================
51
+ decoder/DenseReluDense.activation = 'relu'
52
+ decoder/DenseReluDense.dropout_rate = %dropout_rate
53
+ decoder/DenseReluDense.hidden_size = %d_ff
54
+ decoder/DenseReluDense.use_bias = False
55
+
56
+ # Parameters for encoder/DenseReluDense:
57
+ # ==============================================================================
58
+ encoder/DenseReluDense.activation = 'relu'
59
+ encoder/DenseReluDense.dropout_rate = %dropout_rate
60
+ encoder/DenseReluDense.hidden_size = %d_ff
61
+ encoder/DenseReluDense.use_bias = False
62
+
63
+ # Parameters for enc_dec_attention:
64
+ # ==============================================================================
65
+ # None.
66
+
67
+ # Parameters for enc_dec_attention_bias:
68
+ # ==============================================================================
69
+ # None.
70
+
71
+ # Parameters for decoder/EncDecAttention:
72
+ # ==============================================================================
73
+ decoder/EncDecAttention.relative_attention_type = None
74
+
75
+ # Parameters for get_variable_dtype:
76
+ # ==============================================================================
77
+ get_variable_dtype.activation_dtype = 'bfloat16'
78
+
79
+ # Parameters for get_vocab_embedding_cls:
80
+ # ==============================================================================
81
+ # None.
82
+
83
+ # Parameters for get_vocabulary:
84
+ # ==============================================================================
85
+ get_vocabulary.mixture_or_task_name = %MIXTURE_NAME
86
+
87
+ # Parameters for decoder/LayerStack:
88
+ # ==============================================================================
89
+ decoder/LayerStack.dropout_rate = None
90
+ decoder/LayerStack.norm_epsilon = None
91
+ decoder/LayerStack.recompute_grads = False
92
+ decoder/LayerStack.sublayers_final = \
93
+ [@transformer.sublayer_rms_norm, @transformer.sublayer_dropout]
94
+ decoder/LayerStack.sublayers_initial = [@transformer.sublayer_dropout]
95
+ decoder/LayerStack.sublayers_per_layer = \
96
+ [@transformer.sublayer_rms_norm,
97
+ @transformer.sublayer_call_layer,
98
+ @transformer.sublayer_dropout,
99
+ @transformer.sublayer_residual]
100
+
101
+ # Parameters for encoder/LayerStack:
102
+ # ==============================================================================
103
+ encoder/LayerStack.dropout_rate = None
104
+ encoder/LayerStack.norm_epsilon = None
105
+ encoder/LayerStack.recompute_grads = False
106
+ encoder/LayerStack.sublayers_final = \
107
+ [@transformer.sublayer_rms_norm, @transformer.sublayer_dropout]
108
+ encoder/LayerStack.sublayers_initial = [@transformer.sublayer_dropout]
109
+ encoder/LayerStack.sublayers_per_layer = \
110
+ [@transformer.sublayer_rms_norm,
111
+ @transformer.sublayer_call_layer,
112
+ @transformer.sublayer_dropout,
113
+ @transformer.sublayer_residual]
114
+
115
+ # Parameters for learning_rate_schedule_noam:
116
+ # ==============================================================================
117
+ learning_rate_schedule_noam.linear_decay_fraction = 0.0
118
+ learning_rate_schedule_noam.multiplier = 1.0
119
+ learning_rate_schedule_noam.offset = 0
120
+ learning_rate_schedule_noam.warmup_steps = 10000
121
+
122
+ # Parameters for make_bitransformer:
123
+ # ==============================================================================
124
+ make_bitransformer.decoder_name = 'decoder'
125
+ make_bitransformer.encoder_name = 'encoder'
126
+
127
+ # Parameters for decoder/make_layer_stack:
128
+ # ==============================================================================
129
+ decoder/make_layer_stack.block_scope = True
130
+ decoder/make_layer_stack.layers = \
131
+ [@mesh_tensorflow.transformer.transformer_layers.SelfAttention,
132
+ @mesh_tensorflow.transformer.transformer_layers.EncDecAttention,
133
+ @mesh_tensorflow.transformer.transformer_layers.DenseReluDense]
134
+ decoder/make_layer_stack.num_layers = %num_layers
135
+
136
+ # Parameters for encoder/make_layer_stack:
137
+ # ==============================================================================
138
+ encoder/make_layer_stack.block_scope = True
139
+ encoder/make_layer_stack.layers = \
140
+ [@mesh_tensorflow.transformer.transformer_layers.SelfAttention,
141
+ @mesh_tensorflow.transformer.transformer_layers.DenseReluDense]
142
+ encoder/make_layer_stack.num_layers = %num_layers
143
+
144
+ # Parameters for mesh_train_dataset_fn:
145
+ # ==============================================================================
146
+ mesh_train_dataset_fn.mixture_or_task_name = %MIXTURE_NAME
147
+ mesh_train_dataset_fn.pack = True
148
+ mesh_train_dataset_fn.seed = None
149
+ mesh_train_dataset_fn.use_cached = 1
150
+
151
+ # Parameters for noise_span_to_unique_sentinel:
152
+ # ==============================================================================
153
+ # None.
154
+
155
+ # Parameters for nonnoise_span_to_unique_sentinel:
156
+ # ==============================================================================
157
+ # None.
158
+
159
+ # Parameters for pack_dataset:
160
+ # ==============================================================================
161
+ pack_dataset.use_custom_ops = True
162
+
163
+ # Parameters for pack_or_pad:
164
+ # ==============================================================================
165
+ # None.
166
+
167
+ # Parameters for random_spans_helper:
168
+ # ==============================================================================
169
+ random_spans_helper.extra_tokens_per_span_inputs = 1
170
+ random_spans_helper.extra_tokens_per_span_targets = 1
171
+ random_spans_helper.inputs_length = %inputs_length
172
+ random_spans_helper.mean_noise_span_length = %mean_noise_span_length
173
+ random_spans_helper.noise_density = %noise_density
174
+ random_spans_helper.verbose = False
175
+
176
+ # Parameters for random_spans_noise_mask:
177
+ # ==============================================================================
178
+ random_spans_noise_mask.mean_noise_span_length = %mean_noise_span_length
179
+
180
+ # Parameters for random_spans_tokens_length:
181
+ # ==============================================================================
182
+ # None.
183
+
184
+ # Parameters for reduce_concat_tokens:
185
+ # ==============================================================================
186
+ reduce_concat_tokens.batch_size = 128
187
+ reduce_concat_tokens.feature_key = 'targets'
188
+
189
+ # Parameters for rewrite_stack_variables:
190
+ # ==============================================================================
191
+ rewrite_stack_variables.max_combined_variable_size = 536870912
192
+
193
+ # Parameters for run:
194
+ # ==============================================================================
195
+ run.autostack = True
196
+ run.batch_size = ('tokens_per_batch', 65536)
197
+ run.dataset_split = 'train'
198
+ run.ensemble_inputs = None
199
+ run.eval_checkpoint_step = None
200
+ run.eval_dataset_fn = None
201
+ run.eval_summary_dir = None
202
+ run.export_checkpoint_step = None
203
+ run.export_path = ''
204
+ run.init_checkpoint = None
205
+ run.iterations_per_loop = 100
206
+ run.keep_checkpoint_max = None
207
+ run.layout_rules = \
208
+ 'ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch'
209
+ run.learning_rate_schedule = @learning_rate_schedules.learning_rate_schedule_noam
210
+ run.mesh_devices = None
211
+ run.mesh_shape = @mesh_tensorflow.transformer.utils.tpu_mesh_shape()
212
+ run.mode = 'train'
213
+ run.model_type = 'bitransformer'
214
+ run.optimizer = @optimize.AdafactorOptimizer
215
+ run.output_eval_examples = True
216
+ run.perplexity_eval_steps = 100
217
+ run.predict_fn = None
218
+ run.save_checkpoints_steps = 5000
219
+ run.seen_data_init_step = 0
220
+ run.sequence_length = {'inputs': 512, 'targets': 128}
221
+ run.skip_seen_data = False
222
+ run.total_run_steps = None
223
+ run.train_dataset_fn = @t5.models.mesh_transformer.mesh_train_dataset_fn
224
+ run.train_steps = 524288
225
+ run.variable_filter = None
226
+
227
+ # Parameters for select_random_chunk:
228
+ # ==============================================================================
229
+ select_random_chunk.additional_feature_keys = None
230
+ select_random_chunk.additional_passthrough_keys = None
231
+ select_random_chunk.feature_key = 'targets'
232
+ select_random_chunk.max_length = 65536
233
+ select_random_chunk.uniform_random_start = False
234
+
235
+ # Parameters for decoder/SelfAttention:
236
+ # ==============================================================================
237
+ decoder/SelfAttention.attention_func = None
238
+ decoder/SelfAttention.attention_kwargs = None
239
+ decoder/SelfAttention.combine_dims = True
240
+ decoder/SelfAttention.dropout_rate = %dropout_rate
241
+ decoder/SelfAttention.fold_scaling_into_initializer = True
242
+ decoder/SelfAttention.keep_query_heads_dims = False
243
+ decoder/SelfAttention.key_value_size = %d_kv
244
+ decoder/SelfAttention.num_heads = %num_heads
245
+ decoder/SelfAttention.num_memory_heads = 0
246
+ decoder/SelfAttention.relative_attention_num_buckets = 32
247
+ decoder/SelfAttention.relative_attention_type = 'bias_shared'
248
+ decoder/SelfAttention.shared_kv = False
249
+
250
+ # Parameters for encoder/SelfAttention:
251
+ # ==============================================================================
252
+ encoder/SelfAttention.attention_func = None
253
+ encoder/SelfAttention.attention_kwargs = None
254
+ encoder/SelfAttention.combine_dims = True
255
+ encoder/SelfAttention.dropout_rate = %dropout_rate
256
+ encoder/SelfAttention.fold_scaling_into_initializer = True
257
+ encoder/SelfAttention.keep_query_heads_dims = False
258
+ encoder/SelfAttention.key_value_size = %d_kv
259
+ encoder/SelfAttention.num_heads = %num_heads
260
+ encoder/SelfAttention.num_memory_heads = 0
261
+ encoder/SelfAttention.relative_attention_num_buckets = 32
262
+ encoder/SelfAttention.relative_attention_type = 'bias_shared'
263
+ encoder/SelfAttention.shared_kv = False
264
+
265
+ # Parameters for serialize_num_microbatches:
266
+ # ==============================================================================
267
+ serialize_num_microbatches.tokens_per_microbatch_per_replica = 8192
268
+
269
+ # Parameters for SimdMeshImpl:
270
+ # ==============================================================================
271
+ SimdMeshImpl.allreduce_in_bfloat16_max_group_size = 8
272
+
273
+ # Parameters for split_tokens:
274
+ # ==============================================================================
275
+ split_tokens.additional_feature_keys = None
276
+ split_tokens.feature_key = 'targets'
277
+ split_tokens.max_tokens_per_segment = @preprocessors.random_spans_tokens_length()
278
+ split_tokens.min_tokens_per_segment = None
279
+ split_tokens.passthrough_feature_keys = None
280
+
281
+ # Parameters for sublayer_call_layer:
282
+ # ==============================================================================
283
+ # None.
284
+
285
+ # Parameters for sublayer_dropout:
286
+ # ==============================================================================
287
+ sublayer_dropout.dropout_rate = %dropout_rate
288
+
289
+ # Parameters for sublayer_mask_padding:
290
+ # ==============================================================================
291
+ # None.
292
+
293
+ # Parameters for sublayer_residual:
294
+ # ==============================================================================
295
+ # None.
296
+
297
+ # Parameters for sublayer_rms_norm:
298
+ # ==============================================================================
299
+ sublayer_rms_norm.epsilon = 1e-06
300
+ sublayer_rms_norm.name = 'rms_norm'
301
+
302
+ # Parameters for tpu_estimator_model_fn:
303
+ # ==============================================================================
304
+ tpu_estimator_model_fn.hierarchical_tiling_spec = None
305
+ tpu_estimator_model_fn.init_variable_filter = ''
306
+ tpu_estimator_model_fn.model_info_file = ''
307
+ tpu_estimator_model_fn.outer_batch_size = 1
308
+ tpu_estimator_model_fn.tpu_summaries = False
309
+
310
+ # Parameters for tpu_mesh_shape:
311
+ # ==============================================================================
312
+ tpu_mesh_shape.ensemble_parallelism = None
313
+ tpu_mesh_shape.model_parallelism = 1
314
+ tpu_mesh_shape.tpu_topology = '4x4'
315
+
316
+ # Parameters for unit_scaling_convention:
317
+ # ==============================================================================
318
+ unit_scaling_convention.value = False
319
+
320
+ # Parameters for decoder/Unitransformer:
321
+ # ==============================================================================
322
+ decoder/Unitransformer.d_model = %d_model
323
+ decoder/Unitransformer.ensemble = None
324
+ decoder/Unitransformer.input_full_attention = False
325
+ decoder/Unitransformer.label_smoothing = 0.0
326
+ decoder/Unitransformer.loss_denominator = None
327
+ decoder/Unitransformer.loss_fn = None
328
+ decoder/Unitransformer.loss_on_targets_only = False
329
+ decoder/Unitransformer.max_length = 512
330
+ decoder/Unitransformer.positional_embedding = False
331
+ decoder/Unitransformer.shared_embedding_and_softmax_weights = True
332
+ decoder/Unitransformer.sinusoid_positional_embedding = False
333
+ decoder/Unitransformer.token_dropout_rate = 0.0
334
+ decoder/Unitransformer.vocab_divisor = 128
335
+ decoder/Unitransformer.z_loss = 0.0001
336
+
337
+ # Parameters for encoder/Unitransformer:
338
+ # ==============================================================================
339
+ encoder/Unitransformer.d_model = %d_model
340
+ encoder/Unitransformer.ensemble = None
341
+ encoder/Unitransformer.input_full_attention = False
342
+ encoder/Unitransformer.label_smoothing = 0.0
343
+ encoder/Unitransformer.loss_denominator = None
344
+ encoder/Unitransformer.loss_fn = None
345
+ encoder/Unitransformer.loss_on_targets_only = False
346
+ encoder/Unitransformer.max_length = 512
347
+ encoder/Unitransformer.positional_embedding = False
348
+ encoder/Unitransformer.shared_embedding_and_softmax_weights = True
349
+ encoder/Unitransformer.sinusoid_positional_embedding = False
350
+ encoder/Unitransformer.token_dropout_rate = 0.0
351
+ encoder/Unitransformer.vocab_divisor = 128
352
+ encoder/Unitransformer.z_loss = 0.0001
353
+
354
+ # Parameters for unsupervised:
355
+ # ==============================================================================
356
+ unsupervised.preprocessors = \
357
+ [@preprocessors.select_random_chunk,
358
+ @preprocessors.reduce_concat_tokens,
359
+ @preprocessors.split_tokens,
360
+ @preprocessors.denoise]
361
+
362
+ # Parameters for VarianceScalingInitializer:
363
+ # ==============================================================================
364
+ VarianceScalingInitializer.distribution = 'normal'
365
+ VarianceScalingInitializer.mode = 'fan_in'
366
+ VarianceScalingInitializer.scale = 1.0
367
+
368
+ # Parameters for VocabEmbedding:
369
+ # ==============================================================================
370
+ VocabEmbedding.scale_variable_like_classifier_weights = False
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b840cd5afdcc806b8175fed5a8800a5aa8be1beb60aab8ab7f650728b122dac2
3
+ size 62321434
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "additional_special_tokens": ["<extra_id_0>", "<extra_id_1>", "<extra_id_2>", "<extra_id_3>", "<extra_id_4>", "<extra_id_5>", "<extra_id_6>", "<extra_id_7>", "<extra_id_8>", "<extra_id_9>", "<extra_id_10>", "<extra_id_11>", "<extra_id_12>", "<extra_id_13>", "<extra_id_14>", "<extra_id_15>", "<extra_id_16>", "<extra_id_17>", "<extra_id_18>", "<extra_id_19>", "<extra_id_20>", "<extra_id_21>", "<extra_id_22>", "<extra_id_23>", "<extra_id_24>", "<extra_id_25>", "<extra_id_26>", "<extra_id_27>", "<extra_id_28>", "<extra_id_29>", "<extra_id_30>", "<extra_id_31>", "<extra_id_32>", "<extra_id_33>", "<extra_id_34>", "<extra_id_35>", "<extra_id_36>", "<extra_id_37>", "<extra_id_38>", "<extra_id_39>", "<extra_id_40>", "<extra_id_41>", "<extra_id_42>", "<extra_id_43>", "<extra_id_44>", "<extra_id_45>", "<extra_id_46>", "<extra_id_47>", "<extra_id_48>", "<extra_id_49>", "<extra_id_50>", "<extra_id_51>", "<extra_id_52>", "<extra_id_53>", "<extra_id_54>", "<extra_id_55>", "<extra_id_56>", "<extra_id_57>", "<extra_id_58>", "<extra_id_59>", "<extra_id_60>", "<extra_id_61>", "<extra_id_62>", "<extra_id_63>", "<extra_id_64>", "<extra_id_65>", "<extra_id_66>", "<extra_id_67>", "<extra_id_68>", "<extra_id_69>", "<extra_id_70>", "<extra_id_71>", "<extra_id_72>", "<extra_id_73>", "<extra_id_74>", "<extra_id_75>", "<extra_id_76>", "<extra_id_77>", "<extra_id_78>", "<extra_id_79>", "<extra_id_80>", "<extra_id_81>", "<extra_id_82>", "<extra_id_83>", "<extra_id_84>", "<extra_id_85>", "<extra_id_86>", "<extra_id_87>", "<extra_id_88>", "<extra_id_89>", "<extra_id_90>", "<extra_id_91>", "<extra_id_92>", "<extra_id_93>", "<extra_id_94>", "<extra_id_95>", "<extra_id_96>", "<extra_id_97>", "<extra_id_98>", "<extra_id_99>"]}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d60acb128cf7b7f2536e8f38a5b18a05535c9e14c7a355904270e15b0945ea86
3
+ size 791656
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27947b8cf3fc9d1d879752dfafacda9b8bdf6700ba7059a4b946307b534919ea
3
+ size 62473720
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 100, "additional_special_tokens": ["<extra_id_0>", "<extra_id_1>", "<extra_id_2>", "<extra_id_3>", "<extra_id_4>", "<extra_id_5>", "<extra_id_6>", "<extra_id_7>", "<extra_id_8>", "<extra_id_9>", "<extra_id_10>", "<extra_id_11>", "<extra_id_12>", "<extra_id_13>", "<extra_id_14>", "<extra_id_15>", "<extra_id_16>", "<extra_id_17>", "<extra_id_18>", "<extra_id_19>", "<extra_id_20>", "<extra_id_21>", "<extra_id_22>", "<extra_id_23>", "<extra_id_24>", "<extra_id_25>", "<extra_id_26>", "<extra_id_27>", "<extra_id_28>", "<extra_id_29>", "<extra_id_30>", "<extra_id_31>", "<extra_id_32>", "<extra_id_33>", "<extra_id_34>", "<extra_id_35>", "<extra_id_36>", "<extra_id_37>", "<extra_id_38>", "<extra_id_39>", "<extra_id_40>", "<extra_id_41>", "<extra_id_42>", "<extra_id_43>", "<extra_id_44>", "<extra_id_45>", "<extra_id_46>", "<extra_id_47>", "<extra_id_48>", "<extra_id_49>", "<extra_id_50>", "<extra_id_51>", "<extra_id_52>", "<extra_id_53>", "<extra_id_54>", "<extra_id_55>", "<extra_id_56>", "<extra_id_57>", "<extra_id_58>", "<extra_id_59>", "<extra_id_60>", "<extra_id_61>", "<extra_id_62>", "<extra_id_63>", "<extra_id_64>", "<extra_id_65>", "<extra_id_66>", "<extra_id_67>", "<extra_id_68>", "<extra_id_69>", "<extra_id_70>", "<extra_id_71>", "<extra_id_72>", "<extra_id_73>", "<extra_id_74>", "<extra_id_75>", "<extra_id_76>", "<extra_id_77>", "<extra_id_78>", "<extra_id_79>", "<extra_id_80>", "<extra_id_81>", "<extra_id_82>", "<extra_id_83>", "<extra_id_84>", "<extra_id_85>", "<extra_id_86>", "<extra_id_87>", "<extra_id_88>", "<extra_id_89>", "<extra_id_90>", "<extra_id_91>", "<extra_id_92>", "<extra_id_93>", "<extra_id_94>", "<extra_id_95>", "<extra_id_96>", "<extra_id_97>", "<extra_id_98>", "<extra_id_99>"], "sp_model_kwargs": {}, "name_or_path": "t5-efficient-tiny", "special_tokens_map_file": "t5-efficient-tiny/special_tokens_map.json", "tokenizer_class": "T5Tokenizer"}