The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. 0it [00:00, ?it/s] 0it [00:00, ?it/s] /opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1000 examples [00:00, 9200.31 examples/s] Generating train split: 2264 examples [00:00, 12772.38 examples/s] Generating validation split: 0 examples [00:00, ? examples/s] Generating validation split: 30 examples [00:00, 9002.58 examples/s] Running tokenizer on train dataset: 0%| | 0/2264 [00:00> You are adding a to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is :DefaultFlowCallback WandbCallback /opt/conda/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( All 270 steps, warm_up steps: 200 wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter. wandb: Currently logged in as: abdiharyadi. Use `wandb login --relogin` to force relogin wandb: wandb version 0.18.1 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /kaggle/working/amr-tst-indo/AMRBART-id/fine-tune/wandb/run-20240927_080721-nmb2wrh4 wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run /kaggle/working/amr-tst-indo/AMRBART-id/fine-tune/../outputs/mbart-en-id-smaller-fted-fted wandb: ⭐️ View project at https://wandb.ai/abdiharyadi/amr-tst wandb: 🚀 View run at https://wandb.ai/abdiharyadi/amr-tst/runs/nmb2wrh4 0%| | 0/270 [00:00> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2} /opt/conda/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() 34%|███▎ | 91/270 [03:41<2:26:04, 48.96s/it] 34%|███▍ | 92/270 [03:42<1:42:17, 34.48s/it] 34%|███▍ | 93/270 [03:43<1:11:49, 24.35s/it] 35%|███▍ | 94/270 [03:43<50:34, 17.24s/it] 35%|███▌ | 95/270 [03:44<35:44, 12.26s/it] 36%|███▌ | 96/270 [03:44<25:22, 8.75s/it] 36%|███▌ | 97/270 [03:45<18:16, 6.34s/it] 36%|███▋ | 98/270 [03:46<13:18, 4.64s/it] 37%|███▋ | 99/270 [03:46<09:51, 3.46s/it] 37%|███▋ | 100/270 [03:47<07:28, 2.64s/it] {'loss': 2.9042, 'learning_rate': 5e-07, 'epoch': 1.1} 37%|███▋ | 100/270 [03:47<07:28, 2.64s/it] 37%|███▋ | 101/270 [03:48<05:41, 2.02s/it] 38%|███▊ | 102/270 [03:49<04:34, 1.63s/it] 38%|███▊ | 103/270 [03:49<03:50, 1.38s/it] 39%|███▊ | 104/270 [03:50<03:17, 1.19s/it] 39%|███▉ | 105/270 [03:51<02:52, 1.04s/it] 39%|███▉ | 106/270 [03:51<02:32, 1.08it/s] 40%|███▉ | 107/270 [03:52<02:25, 1.12it/s] 40%|████ | 108/270 [03:53<02:16, 1.19it/s] 40%|████ | 109/270 [03:54<02:07, 1.27it/s] 41%|████ | 110/270 [03:54<01:59, 1.33it/s] 41%|████ | 111/270 [03:55<01:54, 1.39it/s] 41%|████▏ | 112/270 [03:56<01:52, 1.40it/s] 42%|████▏ | 113/270 [03:56<01:49, 1.44it/s] 42%|████▏ | 114/270 [03:57<01:46, 1.46it/s] 43%|████▎ | 115/270 [03:58<01:46, 1.45it/s] 43%|████▎ | 116/270 [03:58<01:47, 1.43it/s] 43%|████▎ | 117/270 [03:59<01:45, 1.46it/s] 44%|████▎ | 118/270 [04:00<01:46, 1.42it/s] 44%|████▍ | 119/270 [04:00<01:46, 1.42it/s] 44%|████▍ | 120/270 [04:01<01:44, 1.43it/s] {'loss': 1.8419, 'learning_rate': 6e-07, 'epoch': 1.32} 44%|████▍ | 120/270 [04:01<01:44, 1.43it/s] 45%|████▍ | 121/270 [04:02<01:43, 1.44it/s] 45%|████▌ | 122/270 [04:03<01:46, 1.40it/s] 46%|████▌ | 123/270 [04:03<01:46, 1.38it/s] 46%|████▌ | 124/270 [04:04<01:40, 1.46it/s] 46%|████▋ | 125/270 [04:05<01:38, 1.48it/s] 47%|████▋ | 126/270 [04:05<01:34, 1.52it/s] 47%|████▋ | 127/270 [04:06<01:34, 1.52it/s] 47%|████▋ | 128/270 [04:07<01:33, 1.52it/s] 48%|████▊ | 129/270 [04:07<01:32, 1.53it/s] 48%|████▊ | 130/270 [04:08<01:35, 1.47it/s] 49%|████▊ | 131/270 [04:08<01:30, 1.54it/s] 49%|████▉ | 132/270 [04:09<01:29, 1.55it/s] 49%|████▉ | 133/270 [04:10<01:31, 1.50it/s] 50%|████▉ | 134/270 [04:11<01:30, 1.50it/s] 50%|█████ | 135/270 [04:11<01:30, 1.48it/s] 50%|█████ | 136/270 [04:12<01:30, 1.49it/s] 51%|█████ | 137/270 [04:12<01:27, 1.52it/s] 51%|█████ | 138/270 [04:13<01:26, 1.53it/s] 51%|█████▏ | 139/270 [04:14<01:24, 1.55it/s] 52%|█████▏ | 140/270 [04:14<01:26, 1.51it/s] {'loss': 1.6323, 'learning_rate': 7e-07, 'epoch': 1.55} 52%|█████▏ | 140/270 [04:14<01:26, 1.51it/s] 52%|█████▏ | 141/270 [04:15<01:30, 1.42it/s] 53%|█████▎ | 142/270 [04:16<01:27, 1.46it/s] 53%|█████▎ | 143/270 [04:17<01:28, 1.43it/s] 53%|█████▎ | 144/270 [04:17<01:27, 1.44it/s] 54%|█████▎ | 145/270 [04:18<01:26, 1.44it/s] 54%|█████▍ | 146/270 [04:19<01:24, 1.46it/s] 54%|█████▍ | 147/270 [04:19<01:22, 1.49it/s] 55%|█████▍ | 148/270 [04:20<01:25, 1.43it/s] 55%|█████▌ | 149/270 [04:21<01:23, 1.46it/s] 56%|█████▌ | 150/270 [04:21<01:21, 1.47it/s] 56%|█████▌ | 151/270 [04:22<01:22, 1.44it/s] 56%|█████▋ | 152/270 [04:23<01:21, 1.44it/s] 57%|█████▋ | 153/270 [04:23<01:19, 1.47it/s] 57%|█████▋ | 154/270 [04:24<01:19, 1.45it/s] 57%|█████▋ | 155/270 [04:25<01:16, 1.50it/s] 58%|█████▊ | 156/270 [04:26<01:20, 1.42it/s] 58%|█████▊ | 157/270 [04:26<01:19, 1.43it/s] 59%|█████▊ | 158/270 [04:27<01:19, 1.40it/s] 59%|█████▉ | 159/270 [04:28<01:15, 1.48it/s] 59%|█████▉ | 160/270 [04:28<01:15, 1.46it/s] {'loss': 1.4964, 'learning_rate': 8e-07, 'epoch': 1.77} 59%|█████▉ | 160/270 [04:28<01:15, 1.46it/s] 60%|█████▉ | 161/270 [04:29<01:13, 1.48it/s] 60%|██████ | 162/270 [04:30<01:13, 1.48it/s] 60%|██████ | 163/270 [04:30<01:10, 1.52it/s] 61%|██████ | 164/270 [04:31<01:11, 1.49it/s] 61%|██████ | 165/270 [04:32<01:09, 1.51it/s] 61%|██████▏ | 166/270 [04:32<01:07, 1.53it/s] 62%|██████▏ | 167/270 [04:33<01:05, 1.57it/s] 62%|██████▏ | 168/270 [04:34<01:08, 1.49it/s] 63%|██████▎ | 169/270 [04:34<01:08, 1.47it/s] 63%|██████▎ | 170/270 [04:35<01:10, 1.42it/s] 63%|██████▎ | 171/270 [04:36<01:07, 1.48it/s] 64%|██████▎ | 172/270 [04:36<01:05, 1.49it/s] 64%|██████▍ | 173/270 [04:37<01:02, 1.55it/s] 64%|██████▍ | 174/270 [04:38<01:01, 1.55it/s] 65%|██████▍ | 175/270 [04:38<01:01, 1.55it/s] 65%|██████▌ | 176/270 [04:39<01:01, 1.53it/s] 66%|██████▌ | 177/270 [04:39<01:00, 1.55it/s] 66%|██████▌ | 178/270 [04:40<00:59, 1.55it/s] 66%|██████▋ | 179/270 [04:41<00:58, 1.56it/s] 67%|██████▋ | 180/270 [04:41<00:58, 1.54it/s] {'loss': 1.5144, 'learning_rate': 9e-07, 'epoch': 1.99} 67%|██████▋ | 180/270 [04:41<00:58, 1.54it/s] 67%|██████▋ | 181/270 [04:42<00:58, 1.52it/s]Generation Kwargs: {'max_length': 1024, 'max_gen_length': 1024, 'num_beams': 5} 0%| | 0/6 [00:00> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2} /opt/conda/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() 67%|██████▋ | 182/270 [04:58<07:42, 5.26s/it] 68%|██████▊ | 183/270 [04:59<05:37, 3.87s/it] 68%|██████▊ | 184/270 [04:59<04:11, 2.93s/it] 69%|██████▊ | 185/270 [05:00<03:12, 2.26s/it] 69%|██████▉ | 186/270 [05:01<02:28, 1.77s/it] 69%|██████▉ | 187/270 [05:01<01:59, 1.44s/it] 70%|██████▉ | 188/270 [05:02<01:39, 1.22s/it] 70%|███████ | 189/270 [05:03<01:26, 1.06s/it] 70%|███████ | 190/270 [05:04<01:16, 1.05it/s] 71%|███████ | 191/270 [05:04<01:09, 1.14it/s] 71%|███████ | 192/270 [05:05<01:02, 1.26it/s] 71%|███████▏ | 193/270 [05:06<00:59, 1.28it/s] 72%|███████▏ | 194/270 [05:06<00:57, 1.32it/s] 72%|███████▏ | 195/270 [05:07<00:55, 1.34it/s] 73%|███████▎ | 196/270 [05:08<00:53, 1.38it/s] 73%|███████▎ | 197/270 [05:08<00:53, 1.37it/s] 73%|███████▎ | 198/270 [05:09<00:50, 1.43it/s] 74%|███████▎ | 199/270 [05:10<00:47, 1.48it/s] 74%|███████▍ | 200/270 [05:10<00:49, 1.42it/s] {'loss': 1.4392, 'learning_rate': 1e-06, 'epoch': 2.21} 74%|███████▍ | 200/270 [05:10<00:49, 1.42it/s] 74%|███████▍ | 201/270 [05:11<00:46, 1.47it/s] 75%|███████▍ | 202/270 [05:12<00:45, 1.50it/s] 75%|███████▌ | 203/270 [05:12<00:44, 1.51it/s] 76%|███████▌ | 204/270 [05:13<00:43, 1.52it/s] 76%|███████▌ | 205/270 [05:14<00:43, 1.49it/s] 76%|███████▋ | 206/270 [05:14<00:44, 1.44it/s] 77%|███████▋ | 207/270 [05:15<00:42, 1.49it/s] 77%|███████▋ | 208/270 [05:16<00:43, 1.43it/s] 77%|███████▋ | 209/270 [05:17<00:42, 1.44it/s] 78%|███████▊ | 210/270 [05:17<00:40, 1.49it/s] 78%|███████▊ | 211/270 [05:18<00:39, 1.50it/s] 79%|███████▊ | 212/270 [05:18<00:37, 1.54it/s] 79%|███████▉ | 213/270 [05:19<00:37, 1.51it/s] 79%|███████▉ | 214/270 [05:20<00:37, 1.48it/s] 80%|███████▉ | 215/270 [05:20<00:35, 1.55it/s] 80%|████████ | 216/270 [05:21<00:34, 1.55it/s] 80%|████████ | 217/270 [05:22<00:33, 1.60it/s] 81%|████████ | 218/270 [05:22<00:32, 1.59it/s] 81%|████████ | 219/270 [05:23<00:32, 1.57it/s] 81%|████████▏ | 220/270 [05:24<00:31, 1.58it/s] {'loss': 1.4428, 'learning_rate': 7.428571428571427e-07, 'epoch': 2.43} 81%|████████▏ | 220/270 [05:24<00:31, 1.58it/s] 82%|████████▏ | 221/270 [05:24<00:30, 1.58it/s] 82%|████████▏ | 222/270 [05:25<00:29, 1.60it/s] 83%|████████▎ | 223/270 [05:25<00:29, 1.59it/s] 83%|████████▎ | 224/270 [05:26<00:31, 1.48it/s] 83%|████████▎ | 225/270 [05:27<00:29, 1.52it/s] 84%|████████▎ | 226/270 [05:27<00:28, 1.53it/s] 84%|████████▍ | 227/270 [05:28<00:28, 1.52it/s] 84%|████████▍ | 228/270 [05:29<00:27, 1.53it/s] 85%|████████▍ | 229/270 [05:29<00:26, 1.58it/s] 85%|████████▌ | 230/270 [05:30<00:24, 1.62it/s] 86%|████████▌ | 231/270 [05:31<00:24, 1.56it/s] 86%|████████▌ | 232/270 [05:31<00:25, 1.52it/s] 86%|████████▋ | 233/270 [05:32<00:24, 1.51it/s] 87%|████████▋ | 234/270 [05:33<00:24, 1.48it/s] 87%|████████▋ | 235/270 [05:33<00:22, 1.55it/s] 87%|████████▋ | 236/270 [05:34<00:22, 1.50it/s] 88%|████████▊ | 237/270 [05:35<00:21, 1.51it/s] 88%|████████▊ | 238/270 [05:35<00:21, 1.50it/s] 89%|████████▊ | 239/270 [05:36<00:20, 1.51it/s] 89%|████████▉ | 240/270 [05:37<00:20, 1.47it/s] {'loss': 1.377, 'learning_rate': 4.857142857142857e-07, 'epoch': 2.65} 89%|████████▉ | 240/270 [05:37<00:20, 1.47it/s] 89%|████████▉ | 241/270 [05:37<00:19, 1.46it/s] 90%|████████▉ | 242/270 [05:38<00:18, 1.53it/s] 90%|█████████ | 243/270 [05:39<00:19, 1.41it/s] 90%|█████████ | 244/270 [05:39<00:18, 1.44it/s] 91%|█████████ | 245/270 [05:40<00:17, 1.41it/s] 91%|█████████ | 246/270 [05:41<00:16, 1.47it/s] 91%|█████████▏| 247/270 [05:42<00:15, 1.48it/s] 92%|█████████▏| 248/270 [05:42<00:14, 1.49it/s] 92%|█████████▏| 249/270 [05:43<00:14, 1.47it/s] 93%|█████████▎| 250/270 [05:44<00:13, 1.48it/s] 93%|█████████▎| 251/270 [05:44<00:13, 1.42it/s] 93%|█████████▎| 252/270 [05:45<00:12, 1.48it/s] 94%|█████████▎| 253/270 [05:46<00:12, 1.41it/s] 94%|█████████▍| 254/270 [05:46<00:11, 1.44it/s] 94%|█████████▍| 255/270 [05:47<00:10, 1.48it/s] 95%|█████████▍| 256/270 [05:48<00:09, 1.51it/s] 95%|█████████▌| 257/270 [05:48<00:08, 1.50it/s] 96%|█████████▌| 258/270 [05:49<00:07, 1.50it/s] 96%|█████████▌| 259/270 [05:50<00:07, 1.48it/s] 96%|█████████▋| 260/270 [05:50<00:06, 1.47it/s] {'loss': 1.3575, 'learning_rate': 2.285714285714286e-07, 'epoch': 2.87} 96%|█████████▋| 260/270 [05:50<00:06, 1.47it/s] 97%|█████████▋| 261/270 [05:51<00:06, 1.47it/s] 97%|█████████▋| 262/270 [05:52<00:05, 1.46it/s] 97%|█████████▋| 263/270 [05:52<00:04, 1.42it/s] 98%|█████████▊| 264/270 [05:53<00:04, 1.47it/s] 98%|█████████▊| 265/270 [05:54<00:03, 1.49it/s] 99%|█████████▊| 266/270 [05:54<00:02, 1.48it/s] 99%|█████████▉| 267/270 [05:55<00:02, 1.45it/s] 99%|█████████▉| 268/270 [05:56<00:01, 1.45it/s] 100%|█████████▉| 269/270 [05:56<00:00, 1.53it/s] 100%|██████████| 270/270 [05:57<00:00, 1.52it/s][WARNING|configuration_utils.py:448] 2024-09-27 08:13:35,534 >> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2} Generation Kwargs: {'max_length': 1024, 'max_gen_length': 1024, 'num_beams': 5} /opt/conda/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() 0%| | 0/6 [00:00> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2} [WARNING|trainer.py:2764] 2024-09-27 08:14:14,430 >> There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight']. {'train_runtime': 413.6641, 'train_samples_per_second': 16.419, 'train_steps_per_second': 0.653, 'train_loss': 2.365596493968257, 'epoch': 2.98} 100%|██████████| 270/270 [06:36<00:00, 1.52it/s] 100%|██████████| 270/270 [06:36<00:00, 1.47s/it] [WARNING|configuration_utils.py:448] 2024-09-27 08:15:07,015 >> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2} [WARNING|configuration_utils.py:448] 2024-09-27 08:15:11,181 >> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2} model.safetensors: 0%| | 0.00/1.58G [00:00