mikr commited on
Commit
fb1802a
1 Parent(s): 9101f9e

Training in progress, step 5000

Browse files
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:51a199ee5e10cdb49b6781596fab3d076f0c14f34e7f0a0212b16834cbb296c6
3
  size 483536061
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e1062e0a39ac80eef975c1aeed734d737cb5183f243dac42da4ea28a58928cc
3
  size 483536061
run.log CHANGED
@@ -1424,3 +1424,254 @@ Rank: 0 partition count [1] and sizes[(241734912, False)]
1424
  [2022-12-19 05:19:38,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1425
  [2022-12-19 05:19:38,828] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1426
  [2022-12-19 05:19:38,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1424
  [2022-12-19 05:19:38,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1425
  [2022-12-19 05:19:38,828] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1426
  [2022-12-19 05:19:38,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
1427
+ [2022-12-19 05:22:56,912] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 65536.0
1428
+ [2022-12-19 05:23:13,569] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
1429
+ [2022-12-19 05:23:46,881] [INFO] [logging.py:68:log_dist] [Rank 0] step=4010, skipped=8, lr=[2.2200000000000003e-06], mom=[[0.9, 0.999]]
1430
+ [2022-12-19 05:23:46,883] [INFO] [timer.py:196:stop] epoch=0/micro_step=4010/global_step=4010, RunningAvgSamplesPerSec=17.611133683583187, CurrSamplesPerSec=17.462168466721625, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1431
+ [2022-12-19 05:26:38,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=4020, skipped=8, lr=[2.197777777777778e-06], mom=[[0.9, 0.999]]
1432
+ [2022-12-19 05:26:38,750] [INFO] [timer.py:196:stop] epoch=0/micro_step=4020/global_step=4020, RunningAvgSamplesPerSec=17.611368779669082, CurrSamplesPerSec=17.378475405939998, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1433
+ {'loss': 0.0002, 'learning_rate': 2.1866666666666668e-06, 'epoch': 34.02}
1434
+ [2022-12-19 05:29:37,420] [INFO] [logging.py:68:log_dist] [Rank 0] step=4030, skipped=8, lr=[2.1755555555555556e-06], mom=[[0.9, 0.999]]
1435
+ [2022-12-19 05:29:37,422] [INFO] [timer.py:196:stop] epoch=0/micro_step=4030/global_step=4030, RunningAvgSamplesPerSec=17.61113977607008, CurrSamplesPerSec=17.737646360030492, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1436
+ [2022-12-19 05:32:38,897] [INFO] [logging.py:68:log_dist] [Rank 0] step=4040, skipped=8, lr=[2.153333333333333e-06], mom=[[0.9, 0.999]]
1437
+ [2022-12-19 05:32:38,898] [INFO] [timer.py:196:stop] epoch=0/micro_step=4040/global_step=4040, RunningAvgSamplesPerSec=17.610970462806403, CurrSamplesPerSec=17.397608183728696, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1438
+ [2022-12-19 05:35:39,166] [INFO] [logging.py:68:log_dist] [Rank 0] step=4050, skipped=8, lr=[2.1311111111111112e-06], mom=[[0.9, 0.999]]
1439
+ [2022-12-19 05:35:39,168] [INFO] [timer.py:196:stop] epoch=0/micro_step=4050/global_step=4050, RunningAvgSamplesPerSec=17.611253470031247, CurrSamplesPerSec=17.833383657490664, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1440
+ {'loss': 0.0002, 'learning_rate': 2.1311111111111112e-06, 'epoch': 34.02}
1441
+ [2022-12-19 05:36:43,776] [INFO] [logging.py:68:log_dist] [Rank 0] step=4060, skipped=8, lr=[2.108888888888889e-06], mom=[[0.9, 0.999]]
1442
+ [2022-12-19 05:36:43,777] [INFO] [timer.py:196:stop] epoch=0/micro_step=4060/global_step=4060, RunningAvgSamplesPerSec=17.612337722470095, CurrSamplesPerSec=23.447783642928687, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1443
+ [2022-12-19 05:41:45,291] [INFO] [logging.py:68:log_dist] [Rank 0] step=4070, skipped=8, lr=[2.086666666666667e-06], mom=[[0.9, 0.999]]
1444
+ [2022-12-19 05:41:45,292] [INFO] [timer.py:196:stop] epoch=0/micro_step=4070/global_step=4070, RunningAvgSamplesPerSec=17.612470816100952, CurrSamplesPerSec=17.877452530444263, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1445
+ {'loss': 0.0002, 'learning_rate': 2.0755555555555557e-06, 'epoch': 35.0}
1446
+ [2022-12-19 05:44:43,364] [INFO] [logging.py:68:log_dist] [Rank 0] step=4080, skipped=8, lr=[2.064444444444445e-06], mom=[[0.9, 0.999]]
1447
+ [2022-12-19 05:44:43,365] [INFO] [timer.py:196:stop] epoch=0/micro_step=4080/global_step=4080, RunningAvgSamplesPerSec=17.612796189774304, CurrSamplesPerSec=17.736692349537687, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1448
+ [2022-12-19 05:47:41,092] [INFO] [logging.py:68:log_dist] [Rank 0] step=4090, skipped=8, lr=[2.0422222222222225e-06], mom=[[0.9, 0.999]]
1449
+ [2022-12-19 05:47:41,093] [INFO] [timer.py:196:stop] epoch=0/micro_step=4090/global_step=4090, RunningAvgSamplesPerSec=17.612917765334718, CurrSamplesPerSec=17.74363646795449, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1450
+ [2022-12-19 05:50:36,586] [INFO] [logging.py:68:log_dist] [Rank 0] step=4100, skipped=8, lr=[2.02e-06], mom=[[0.9, 0.999]]
1451
+ [2022-12-19 05:50:36,587] [INFO] [timer.py:196:stop] epoch=0/micro_step=4100/global_step=4100, RunningAvgSamplesPerSec=17.61301264236775, CurrSamplesPerSec=17.761625214894558, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1452
+ {'loss': 0.0002, 'learning_rate': 2.02e-06, 'epoch': 35.01}
1453
+ [2022-12-19 05:53:34,916] [INFO] [logging.py:68:log_dist] [Rank 0] step=4110, skipped=8, lr=[1.9977777777777778e-06], mom=[[0.9, 0.999]]
1454
+ [2022-12-19 05:53:34,917] [INFO] [timer.py:196:stop] epoch=0/micro_step=4110/global_step=4110, RunningAvgSamplesPerSec=17.61307623650117, CurrSamplesPerSec=17.558381822907993, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1455
+ [2022-12-19 05:56:32,454] [INFO] [logging.py:68:log_dist] [Rank 0] step=4120, skipped=8, lr=[1.975555555555556e-06], mom=[[0.9, 0.999]]
1456
+ [2022-12-19 05:56:32,456] [INFO] [timer.py:196:stop] epoch=0/micro_step=4120/global_step=4120, RunningAvgSamplesPerSec=17.61328460598005, CurrSamplesPerSec=17.849423032435844, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1457
+ {'loss': 0.0002, 'learning_rate': 1.9644444444444446e-06, 'epoch': 35.01}
1458
+ [2022-12-19 05:59:28,958] [INFO] [logging.py:68:log_dist] [Rank 0] step=4130, skipped=8, lr=[1.9533333333333334e-06], mom=[[0.9, 0.999]]
1459
+ [2022-12-19 05:59:28,959] [INFO] [timer.py:196:stop] epoch=0/micro_step=4130/global_step=4130, RunningAvgSamplesPerSec=17.61368146124737, CurrSamplesPerSec=17.80531019001806, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1460
+ [2022-12-19 06:02:37,403] [INFO] [logging.py:68:log_dist] [Rank 0] step=4140, skipped=8, lr=[1.9311111111111114e-06], mom=[[0.9, 0.999]]
1461
+ [2022-12-19 06:02:37,404] [INFO] [timer.py:196:stop] epoch=0/micro_step=4140/global_step=4140, RunningAvgSamplesPerSec=17.613849828726913, CurrSamplesPerSec=17.741122217014265, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1462
+ [2022-12-19 06:05:39,451] [INFO] [logging.py:68:log_dist] [Rank 0] step=4150, skipped=8, lr=[1.908888888888889e-06], mom=[[0.9, 0.999]]
1463
+ [2022-12-19 06:05:39,452] [INFO] [timer.py:196:stop] epoch=0/micro_step=4150/global_step=4150, RunningAvgSamplesPerSec=17.61420113756101, CurrSamplesPerSec=17.687389147202236, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1464
+ {'loss': 0.0002, 'learning_rate': 1.908888888888889e-06, 'epoch': 35.02}
1465
+ [2022-12-19 06:08:34,312] [INFO] [logging.py:68:log_dist] [Rank 0] step=4160, skipped=8, lr=[1.8866666666666669e-06], mom=[[0.9, 0.999]]
1466
+ [2022-12-19 06:08:34,313] [INFO] [timer.py:196:stop] epoch=0/micro_step=4160/global_step=4160, RunningAvgSamplesPerSec=17.614407249990197, CurrSamplesPerSec=17.847925309782504, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1467
+ [2022-12-19 06:11:07,038] [INFO] [logging.py:68:log_dist] [Rank 0] step=4170, skipped=8, lr=[1.8644444444444445e-06], mom=[[0.9, 0.999]]
1468
+ [2022-12-19 06:11:07,039] [INFO] [timer.py:196:stop] epoch=0/micro_step=4170/global_step=4170, RunningAvgSamplesPerSec=17.614796295687643, CurrSamplesPerSec=17.6957562558827, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1469
+ {'loss': 0.0002, 'learning_rate': 1.8533333333333333e-06, 'epoch': 35.02}
1470
+ [2022-12-19 06:14:26,808] [INFO] [logging.py:68:log_dist] [Rank 0] step=4180, skipped=8, lr=[1.8422222222222225e-06], mom=[[0.9, 0.999]]
1471
+ [2022-12-19 06:14:26,809] [INFO] [timer.py:196:stop] epoch=0/micro_step=4180/global_step=4180, RunningAvgSamplesPerSec=17.615978855666164, CurrSamplesPerSec=17.517680699168547, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1472
+ [2022-12-19 06:17:15,114] [INFO] [logging.py:68:log_dist] [Rank 0] step=4190, skipped=8, lr=[1.8200000000000002e-06], mom=[[0.9, 0.999]]
1473
+ [2022-12-19 06:17:15,116] [INFO] [timer.py:196:stop] epoch=0/micro_step=4190/global_step=4190, RunningAvgSamplesPerSec=17.616242542847015, CurrSamplesPerSec=17.71552661785773, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1474
+ [2022-12-19 06:20:05,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=4200, skipped=8, lr=[1.797777777777778e-06], mom=[[0.9, 0.999]]
1475
+ [2022-12-19 06:20:05,756] [INFO] [timer.py:196:stop] epoch=0/micro_step=4200/global_step=4200, RunningAvgSamplesPerSec=17.61609844770078, CurrSamplesPerSec=17.366084738146398, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1476
+ {'loss': 0.0002, 'learning_rate': 1.797777777777778e-06, 'epoch': 36.0}
1477
+ [2022-12-19 06:22:59,727] [INFO] [logging.py:68:log_dist] [Rank 0] step=4210, skipped=8, lr=[1.7755555555555556e-06], mom=[[0.9, 0.999]]
1478
+ [2022-12-19 06:22:59,728] [INFO] [timer.py:196:stop] epoch=0/micro_step=4210/global_step=4210, RunningAvgSamplesPerSec=17.616157473369498, CurrSamplesPerSec=17.633430354752342, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1479
+ [2022-12-19 06:25:55,318] [INFO] [logging.py:68:log_dist] [Rank 0] step=4220, skipped=8, lr=[1.7533333333333336e-06], mom=[[0.9, 0.999]]
1480
+ [2022-12-19 06:25:55,320] [INFO] [timer.py:196:stop] epoch=0/micro_step=4220/global_step=4220, RunningAvgSamplesPerSec=17.616198217116857, CurrSamplesPerSec=17.707843993918647, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1481
+ {'loss': 0.0002, 'learning_rate': 1.7422222222222224e-06, 'epoch': 36.01}
1482
+ [2022-12-19 06:28:57,297] [INFO] [logging.py:68:log_dist] [Rank 0] step=4230, skipped=8, lr=[1.7311111111111112e-06], mom=[[0.9, 0.999]]
1483
+ [2022-12-19 06:28:57,298] [INFO] [timer.py:196:stop] epoch=0/micro_step=4230/global_step=4230, RunningAvgSamplesPerSec=17.61652951646244, CurrSamplesPerSec=17.71409803827759, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1484
+ [2022-12-19 06:31:59,467] [INFO] [logging.py:68:log_dist] [Rank 0] step=4240, skipped=8, lr=[1.708888888888889e-06], mom=[[0.9, 0.999]]
1485
+ [2022-12-19 06:31:59,468] [INFO] [timer.py:196:stop] epoch=0/micro_step=4240/global_step=4240, RunningAvgSamplesPerSec=17.617018924776456, CurrSamplesPerSec=17.79433342574028, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1486
+ [2022-12-19 06:34:52,886] [INFO] [logging.py:68:log_dist] [Rank 0] step=4250, skipped=8, lr=[1.6866666666666667e-06], mom=[[0.9, 0.999]]
1487
+ [2022-12-19 06:34:52,888] [INFO] [timer.py:196:stop] epoch=0/micro_step=4250/global_step=4250, RunningAvgSamplesPerSec=17.617133665736407, CurrSamplesPerSec=17.89488941065793, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1488
+ {'loss': 0.0002, 'learning_rate': 1.6866666666666667e-06, 'epoch': 36.01}
1489
+ [2022-12-19 06:37:46,872] [INFO] [logging.py:68:log_dist] [Rank 0] step=4260, skipped=8, lr=[1.6644444444444447e-06], mom=[[0.9, 0.999]]
1490
+ [2022-12-19 06:37:46,874] [INFO] [timer.py:196:stop] epoch=0/micro_step=4260/global_step=4260, RunningAvgSamplesPerSec=17.61709970360881, CurrSamplesPerSec=17.72886490216432, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1491
+ [2022-12-19 06:40:46,545] [INFO] [logging.py:68:log_dist] [Rank 0] step=4270, skipped=8, lr=[1.6422222222222223e-06], mom=[[0.9, 0.999]]
1492
+ [2022-12-19 06:40:46,546] [INFO] [timer.py:196:stop] epoch=0/micro_step=4270/global_step=4270, RunningAvgSamplesPerSec=17.61723647083081, CurrSamplesPerSec=17.50934176766855, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1493
+ {'loss': 0.0002, 'learning_rate': 1.6311111111111114e-06, 'epoch': 36.02}
1494
+ [2022-12-19 06:43:41,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=4280, skipped=8, lr=[1.6200000000000002e-06], mom=[[0.9, 0.999]]
1495
+ [2022-12-19 06:43:41,741] [INFO] [timer.py:196:stop] epoch=0/micro_step=4280/global_step=4280, RunningAvgSamplesPerSec=17.617458572115385, CurrSamplesPerSec=17.8233958672181, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1496
+ [2022-12-19 06:45:13,558] [INFO] [logging.py:68:log_dist] [Rank 0] step=4290, skipped=8, lr=[1.5977777777777778e-06], mom=[[0.9, 0.999]]
1497
+ [2022-12-19 06:45:13,560] [INFO] [timer.py:196:stop] epoch=0/micro_step=4290/global_step=4290, RunningAvgSamplesPerSec=17.61748707733689, CurrSamplesPerSec=17.694014788106895, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1498
+ [2022-12-19 06:49:29,081] [INFO] [logging.py:68:log_dist] [Rank 0] step=4300, skipped=8, lr=[1.5755555555555558e-06], mom=[[0.9, 0.999]]
1499
+ [2022-12-19 06:49:29,083] [INFO] [timer.py:196:stop] epoch=0/micro_step=4300/global_step=4300, RunningAvgSamplesPerSec=17.618675777869846, CurrSamplesPerSec=17.604989397739672, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1500
+ {'loss': 0.0002, 'learning_rate': 1.5755555555555558e-06, 'epoch': 37.0}
1501
+ [2022-12-19 06:52:25,502] [INFO] [logging.py:68:log_dist] [Rank 0] step=4310, skipped=8, lr=[1.5533333333333334e-06], mom=[[0.9, 0.999]]
1502
+ [2022-12-19 06:52:25,503] [INFO] [timer.py:196:stop] epoch=0/micro_step=4310/global_step=4310, RunningAvgSamplesPerSec=17.618917658378976, CurrSamplesPerSec=17.888580989149183, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1503
+ [2022-12-19 06:55:21,753] [INFO] [logging.py:68:log_dist] [Rank 0] step=4320, skipped=8, lr=[1.5311111111111113e-06], mom=[[0.9, 0.999]]
1504
+ [2022-12-19 06:55:21,754] [INFO] [timer.py:196:stop] epoch=0/micro_step=4320/global_step=4320, RunningAvgSamplesPerSec=17.619314539730464, CurrSamplesPerSec=17.772010979056823, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1505
+ {'loss': 0.0002, 'learning_rate': 1.52e-06, 'epoch': 37.01}
1506
+ [2022-12-19 06:58:18,142] [INFO] [logging.py:68:log_dist] [Rank 0] step=4330, skipped=8, lr=[1.5088888888888889e-06], mom=[[0.9, 0.999]]
1507
+ [2022-12-19 06:58:18,144] [INFO] [timer.py:196:stop] epoch=0/micro_step=4330/global_step=4330, RunningAvgSamplesPerSec=17.619612314250787, CurrSamplesPerSec=17.79300532950759, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1508
+ [2022-12-19 07:01:11,886] [INFO] [logging.py:68:log_dist] [Rank 0] step=4340, skipped=8, lr=[1.486666666666667e-06], mom=[[0.9, 0.999]]
1509
+ [2022-12-19 07:01:11,888] [INFO] [timer.py:196:stop] epoch=0/micro_step=4340/global_step=4340, RunningAvgSamplesPerSec=17.619731554882748, CurrSamplesPerSec=17.831246620133083, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1510
+ [2022-12-19 07:04:10,104] [INFO] [logging.py:68:log_dist] [Rank 0] step=4350, skipped=8, lr=[1.4644444444444445e-06], mom=[[0.9, 0.999]]
1511
+ [2022-12-19 07:04:10,105] [INFO] [timer.py:196:stop] epoch=0/micro_step=4350/global_step=4350, RunningAvgSamplesPerSec=17.619778835395326, CurrSamplesPerSec=17.918863768337292, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1512
+ {'loss': 0.0002, 'learning_rate': 1.4644444444444445e-06, 'epoch': 37.01}
1513
+ [2022-12-19 07:07:18,279] [INFO] [logging.py:68:log_dist] [Rank 0] step=4360, skipped=8, lr=[1.4422222222222223e-06], mom=[[0.9, 0.999]]
1514
+ [2022-12-19 07:07:18,280] [INFO] [timer.py:196:stop] epoch=0/micro_step=4360/global_step=4360, RunningAvgSamplesPerSec=17.61996490248731, CurrSamplesPerSec=17.26081994397919, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1515
+ [2022-12-19 07:10:24,446] [INFO] [logging.py:68:log_dist] [Rank 0] step=4370, skipped=8, lr=[1.42e-06], mom=[[0.9, 0.999]]
1516
+ [2022-12-19 07:10:24,448] [INFO] [timer.py:196:stop] epoch=0/micro_step=4370/global_step=4370, RunningAvgSamplesPerSec=17.62001193317469, CurrSamplesPerSec=17.401952624968608, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1517
+ {'loss': 0.0002, 'learning_rate': 1.4088888888888892e-06, 'epoch': 37.02}
1518
+ [2022-12-19 07:13:17,536] [INFO] [logging.py:68:log_dist] [Rank 0] step=4380, skipped=8, lr=[1.397777777777778e-06], mom=[[0.9, 0.999]]
1519
+ [2022-12-19 07:13:17,537] [INFO] [timer.py:196:stop] epoch=0/micro_step=4380/global_step=4380, RunningAvgSamplesPerSec=17.619935746572583, CurrSamplesPerSec=17.670992723198474, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1520
+ [2022-12-19 07:16:09,586] [INFO] [logging.py:68:log_dist] [Rank 0] step=4390, skipped=8, lr=[1.3755555555555556e-06], mom=[[0.9, 0.999]]
1521
+ [2022-12-19 07:16:09,588] [INFO] [timer.py:196:stop] epoch=0/micro_step=4390/global_step=4390, RunningAvgSamplesPerSec=17.62012413538634, CurrSamplesPerSec=17.81988415153239, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1522
+ [2022-12-19 07:19:00,421] [INFO] [logging.py:68:log_dist] [Rank 0] step=4400, skipped=8, lr=[1.3533333333333334e-06], mom=[[0.9, 0.999]]
1523
+ [2022-12-19 07:19:00,423] [INFO] [timer.py:196:stop] epoch=0/micro_step=4400/global_step=4400, RunningAvgSamplesPerSec=17.620059714852356, CurrSamplesPerSec=17.51520378570747, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1524
+ {'loss': 0.0002, 'learning_rate': 1.3533333333333334e-06, 'epoch': 37.02}
1525
+ [2022-12-19 07:21:47,475] [INFO] [logging.py:68:log_dist] [Rank 0] step=4410, skipped=8, lr=[1.3311111111111113e-06], mom=[[0.9, 0.999]]
1526
+ [2022-12-19 07:21:47,477] [INFO] [timer.py:196:stop] epoch=0/micro_step=4410/global_step=4410, RunningAvgSamplesPerSec=17.621140582831963, CurrSamplesPerSec=17.79351248317827, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1527
+ [2022-12-19 07:24:37,405] [INFO] [logging.py:68:log_dist] [Rank 0] step=4420, skipped=8, lr=[1.308888888888889e-06], mom=[[0.9, 0.999]]
1528
+ [2022-12-19 07:24:37,407] [INFO] [timer.py:196:stop] epoch=0/micro_step=4420/global_step=4420, RunningAvgSamplesPerSec=17.621243226652567, CurrSamplesPerSec=17.81804129124176, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1529
+ {'loss': 0.0002, 'learning_rate': 1.2977777777777779e-06, 'epoch': 38.0}
1530
+ [2022-12-19 07:27:30,942] [INFO] [logging.py:68:log_dist] [Rank 0] step=4430, skipped=8, lr=[1.286666666666667e-06], mom=[[0.9, 0.999]]
1531
+ [2022-12-19 07:27:30,943] [INFO] [timer.py:196:stop] epoch=0/micro_step=4430/global_step=4430, RunningAvgSamplesPerSec=17.62120393290399, CurrSamplesPerSec=17.173806418163036, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1532
+ [2022-12-19 07:30:23,275] [INFO] [logging.py:68:log_dist] [Rank 0] step=4440, skipped=8, lr=[1.2644444444444445e-06], mom=[[0.9, 0.999]]
1533
+ [2022-12-19 07:30:23,277] [INFO] [timer.py:196:stop] epoch=0/micro_step=4440/global_step=4440, RunningAvgSamplesPerSec=17.621216006629336, CurrSamplesPerSec=17.66344978593792, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1534
+ [2022-12-19 07:33:14,028] [INFO] [logging.py:68:log_dist] [Rank 0] step=4450, skipped=8, lr=[1.2422222222222224e-06], mom=[[0.9, 0.999]]
1535
+ [2022-12-19 07:33:14,029] [INFO] [timer.py:196:stop] epoch=0/micro_step=4450/global_step=4450, RunningAvgSamplesPerSec=17.621087033427646, CurrSamplesPerSec=17.209054639288738, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1536
+ {'loss': 0.0002, 'learning_rate': 1.2422222222222224e-06, 'epoch': 38.01}
1537
+ [2022-12-19 07:36:06,709] [INFO] [logging.py:68:log_dist] [Rank 0] step=4460, skipped=8, lr=[1.2200000000000002e-06], mom=[[0.9, 0.999]]
1538
+ [2022-12-19 07:36:06,710] [INFO] [timer.py:196:stop] epoch=0/micro_step=4460/global_step=4460, RunningAvgSamplesPerSec=17.620917138316685, CurrSamplesPerSec=17.78626293383821, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1539
+ [2022-12-19 07:39:01,972] [INFO] [logging.py:68:log_dist] [Rank 0] step=4470, skipped=8, lr=[1.1977777777777778e-06], mom=[[0.9, 0.999]]
1540
+ [2022-12-19 07:39:01,973] [INFO] [timer.py:196:stop] epoch=0/micro_step=4470/global_step=4470, RunningAvgSamplesPerSec=17.620983810960013, CurrSamplesPerSec=17.796430947697868, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1541
+ {'loss': 0.0002, 'learning_rate': 1.1866666666666668e-06, 'epoch': 38.01}
1542
+ [2022-12-19 07:42:03,726] [INFO] [logging.py:68:log_dist] [Rank 0] step=4480, skipped=8, lr=[1.1755555555555556e-06], mom=[[0.9, 0.999]]
1543
+ [2022-12-19 07:42:03,728] [INFO] [timer.py:196:stop] epoch=0/micro_step=4480/global_step=4480, RunningAvgSamplesPerSec=17.62096470995377, CurrSamplesPerSec=17.768253674908266, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1544
+ [2022-12-19 07:44:54,699] [INFO] [logging.py:68:log_dist] [Rank 0] step=4490, skipped=8, lr=[1.1533333333333334e-06], mom=[[0.9, 0.999]]
1545
+ [2022-12-19 07:44:54,700] [INFO] [timer.py:196:stop] epoch=0/micro_step=4490/global_step=4490, RunningAvgSamplesPerSec=17.620810807024053, CurrSamplesPerSec=17.12529074539211, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1546
+ [2022-12-19 07:47:45,410] [INFO] [logging.py:68:log_dist] [Rank 0] step=4500, skipped=8, lr=[1.131111111111111e-06], mom=[[0.9, 0.999]]
1547
+ [2022-12-19 07:47:45,411] [INFO] [timer.py:196:stop] epoch=0/micro_step=4500/global_step=4500, RunningAvgSamplesPerSec=17.62074008642063, CurrSamplesPerSec=17.679443294912847, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1548
+ {'loss': 0.0002, 'learning_rate': 1.131111111111111e-06, 'epoch': 38.02}
1549
+ [2022-12-19 07:50:35,915] [INFO] [logging.py:68:log_dist] [Rank 0] step=4510, skipped=8, lr=[1.1088888888888889e-06], mom=[[0.9, 0.999]]
1550
+ [2022-12-19 07:50:35,917] [INFO] [timer.py:196:stop] epoch=0/micro_step=4510/global_step=4510, RunningAvgSamplesPerSec=17.620564657864747, CurrSamplesPerSec=17.742456651398978, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1551
+ [2022-12-19 07:52:31,359] [INFO] [logging.py:68:log_dist] [Rank 0] step=4520, skipped=8, lr=[1.0866666666666667e-06], mom=[[0.9, 0.999]]
1552
+ [2022-12-19 07:52:31,361] [INFO] [timer.py:196:stop] epoch=0/micro_step=4520/global_step=4520, RunningAvgSamplesPerSec=17.62064058371775, CurrSamplesPerSec=17.814565970324615, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1553
+ {'loss': 0.0002, 'learning_rate': 1.0755555555555557e-06, 'epoch': 39.0}
1554
+ [2022-12-19 07:56:08,715] [INFO] [logging.py:68:log_dist] [Rank 0] step=4530, skipped=8, lr=[1.0644444444444445e-06], mom=[[0.9, 0.999]]
1555
+ [2022-12-19 07:56:08,716] [INFO] [timer.py:196:stop] epoch=0/micro_step=4530/global_step=4530, RunningAvgSamplesPerSec=17.62153945771267, CurrSamplesPerSec=17.538607510040734, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1556
+ [2022-12-19 07:58:58,490] [INFO] [logging.py:68:log_dist] [Rank 0] step=4540, skipped=8, lr=[1.0422222222222221e-06], mom=[[0.9, 0.999]]
1557
+ [2022-12-19 07:58:58,492] [INFO] [timer.py:196:stop] epoch=0/micro_step=4540/global_step=4540, RunningAvgSamplesPerSec=17.621602807771676, CurrSamplesPerSec=17.587700090174195, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1558
+ [2022-12-19 08:01:48,409] [INFO] [logging.py:68:log_dist] [Rank 0] step=4550, skipped=8, lr=[1.02e-06], mom=[[0.9, 0.999]]
1559
+ [2022-12-19 08:01:48,411] [INFO] [timer.py:196:stop] epoch=0/micro_step=4550/global_step=4550, RunningAvgSamplesPerSec=17.62183130814374, CurrSamplesPerSec=17.59054796755746, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1560
+ {'loss': 0.0002, 'learning_rate': 1.02e-06, 'epoch': 39.01}
1561
+ [2022-12-19 08:04:41,762] [INFO] [logging.py:68:log_dist] [Rank 0] step=4560, skipped=8, lr=[9.97777777777778e-07], mom=[[0.9, 0.999]]
1562
+ [2022-12-19 08:04:41,763] [INFO] [timer.py:196:stop] epoch=0/micro_step=4560/global_step=4560, RunningAvgSamplesPerSec=17.621998562878264, CurrSamplesPerSec=17.616155687347188, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1563
+ [2022-12-19 08:07:38,563] [INFO] [logging.py:68:log_dist] [Rank 0] step=4570, skipped=8, lr=[9.755555555555556e-07], mom=[[0.9, 0.999]]
1564
+ [2022-12-19 08:07:38,564] [INFO] [timer.py:196:stop] epoch=0/micro_step=4570/global_step=4570, RunningAvgSamplesPerSec=17.621957179552908, CurrSamplesPerSec=17.34689564215522, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1565
+ {'loss': 0.0002, 'learning_rate': 9.644444444444444e-07, 'epoch': 39.01}
1566
+ [2022-12-19 08:10:35,179] [INFO] [logging.py:68:log_dist] [Rank 0] step=4580, skipped=8, lr=[9.533333333333335e-07], mom=[[0.9, 0.999]]
1567
+ [2022-12-19 08:10:35,180] [INFO] [timer.py:196:stop] epoch=0/micro_step=4580/global_step=4580, RunningAvgSamplesPerSec=17.622211087890516, CurrSamplesPerSec=17.696736200907107, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1568
+ [2022-12-19 08:13:29,992] [INFO] [logging.py:68:log_dist] [Rank 0] step=4590, skipped=8, lr=[9.311111111111113e-07], mom=[[0.9, 0.999]]
1569
+ [2022-12-19 08:13:29,993] [INFO] [timer.py:196:stop] epoch=0/micro_step=4590/global_step=4590, RunningAvgSamplesPerSec=17.62224562573623, CurrSamplesPerSec=17.885243737876667, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1570
+ [2022-12-19 08:16:25,912] [INFO] [logging.py:68:log_dist] [Rank 0] step=4600, skipped=8, lr=[9.08888888888889e-07], mom=[[0.9, 0.999]]
1571
+ [2022-12-19 08:16:25,914] [INFO] [timer.py:196:stop] epoch=0/micro_step=4600/global_step=4600, RunningAvgSamplesPerSec=17.62219766153009, CurrSamplesPerSec=17.437179457948137, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1572
+ {'loss': 0.0002, 'learning_rate': 9.08888888888889e-07, 'epoch': 39.02}
1573
+ [2022-12-19 08:19:22,584] [INFO] [logging.py:68:log_dist] [Rank 0] step=4610, skipped=8, lr=[8.866666666666668e-07], mom=[[0.9, 0.999]]
1574
+ [2022-12-19 08:19:22,585] [INFO] [timer.py:196:stop] epoch=0/micro_step=4610/global_step=4610, RunningAvgSamplesPerSec=17.622381934268187, CurrSamplesPerSec=17.858060608665216, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1575
+ [2022-12-19 08:22:18,644] [INFO] [logging.py:68:log_dist] [Rank 0] step=4620, skipped=8, lr=[8.644444444444445e-07], mom=[[0.9, 0.999]]
1576
+ [2022-12-19 08:22:18,645] [INFO] [timer.py:196:stop] epoch=0/micro_step=4620/global_step=4620, RunningAvgSamplesPerSec=17.622457478635866, CurrSamplesPerSec=17.62142545329462, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1577
+ {'loss': 0.0002, 'learning_rate': 8.533333333333334e-07, 'epoch': 39.02}
1578
+ [2022-12-19 08:25:19,257] [INFO] [logging.py:68:log_dist] [Rank 0] step=4630, skipped=8, lr=[8.422222222222224e-07], mom=[[0.9, 0.999]]
1579
+ [2022-12-19 08:25:19,259] [INFO] [timer.py:196:stop] epoch=0/micro_step=4630/global_step=4630, RunningAvgSamplesPerSec=17.62274117333296, CurrSamplesPerSec=17.796734173722378, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1580
+ [2022-12-19 08:26:22,516] [INFO] [logging.py:68:log_dist] [Rank 0] step=4640, skipped=8, lr=[8.200000000000001e-07], mom=[[0.9, 0.999]]
1581
+ [2022-12-19 08:26:22,517] [INFO] [timer.py:196:stop] epoch=0/micro_step=4640/global_step=4640, RunningAvgSamplesPerSec=17.624083237549367, CurrSamplesPerSec=23.52461524938626, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1582
+ [2022-12-19 08:31:08,975] [INFO] [logging.py:68:log_dist] [Rank 0] step=4650, skipped=8, lr=[7.977777777777779e-07], mom=[[0.9, 0.999]]
1583
+ [2022-12-19 08:31:08,977] [INFO] [timer.py:196:stop] epoch=0/micro_step=4650/global_step=4650, RunningAvgSamplesPerSec=17.624288170024037, CurrSamplesPerSec=17.88550113865089, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1584
+ {'loss': 0.0002, 'learning_rate': 7.977777777777779e-07, 'epoch': 40.0}
1585
+ [2022-12-19 08:34:15,146] [INFO] [logging.py:68:log_dist] [Rank 0] step=4660, skipped=8, lr=[7.755555555555556e-07], mom=[[0.9, 0.999]]
1586
+ [2022-12-19 08:34:15,147] [INFO] [timer.py:196:stop] epoch=0/micro_step=4660/global_step=4660, RunningAvgSamplesPerSec=17.62444471234681, CurrSamplesPerSec=17.69476125609619, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1587
+ [2022-12-19 08:37:03,411] [INFO] [logging.py:68:log_dist] [Rank 0] step=4670, skipped=8, lr=[7.533333333333335e-07], mom=[[0.9, 0.999]]
1588
+ [2022-12-19 08:37:03,413] [INFO] [timer.py:196:stop] epoch=0/micro_step=4670/global_step=4670, RunningAvgSamplesPerSec=17.62448466929025, CurrSamplesPerSec=17.483528705282534, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1589
+ {'loss': 0.0002, 'learning_rate': 7.422222222222223e-07, 'epoch': 40.01}
1590
+ [2022-12-19 08:39:55,845] [INFO] [logging.py:68:log_dist] [Rank 0] step=4680, skipped=8, lr=[7.311111111111112e-07], mom=[[0.9, 0.999]]
1591
+ [2022-12-19 08:39:55,846] [INFO] [timer.py:196:stop] epoch=0/micro_step=4680/global_step=4680, RunningAvgSamplesPerSec=17.620159783831323, CurrSamplesPerSec=17.60117311667079, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1592
+ [2022-12-19 08:42:42,018] [INFO] [logging.py:68:log_dist] [Rank 0] step=4690, skipped=8, lr=[7.08888888888889e-07], mom=[[0.9, 0.999]]
1593
+ [2022-12-19 08:42:42,019] [INFO] [timer.py:196:stop] epoch=0/micro_step=4690/global_step=4690, RunningAvgSamplesPerSec=17.620279729725365, CurrSamplesPerSec=17.93593334380126, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1594
+ [2022-12-19 08:45:27,450] [INFO] [logging.py:68:log_dist] [Rank 0] step=4700, skipped=8, lr=[6.866666666666667e-07], mom=[[0.9, 0.999]]
1595
+ [2022-12-19 08:45:27,452] [INFO] [timer.py:196:stop] epoch=0/micro_step=4700/global_step=4700, RunningAvgSamplesPerSec=17.620331377429462, CurrSamplesPerSec=17.754938283281913, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1596
+ {'loss': 0.0002, 'learning_rate': 6.866666666666667e-07, 'epoch': 40.01}
1597
+ [2022-12-19 08:48:13,020] [INFO] [logging.py:68:log_dist] [Rank 0] step=4710, skipped=8, lr=[6.644444444444446e-07], mom=[[0.9, 0.999]]
1598
+ [2022-12-19 08:48:13,022] [INFO] [timer.py:196:stop] epoch=0/micro_step=4710/global_step=4710, RunningAvgSamplesPerSec=17.620347520092913, CurrSamplesPerSec=17.775257844597622, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1599
+ [2022-12-19 08:50:59,205] [INFO] [logging.py:68:log_dist] [Rank 0] step=4720, skipped=8, lr=[6.422222222222223e-07], mom=[[0.9, 0.999]]
1600
+ [2022-12-19 08:50:59,207] [INFO] [timer.py:196:stop] epoch=0/micro_step=4720/global_step=4720, RunningAvgSamplesPerSec=17.62027305945099, CurrSamplesPerSec=17.454496273463086, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1601
+ {'loss': 0.0002, 'learning_rate': 6.311111111111112e-07, 'epoch': 40.02}
1602
+ [2022-12-19 08:53:45,256] [INFO] [logging.py:68:log_dist] [Rank 0] step=4730, skipped=8, lr=[6.200000000000001e-07], mom=[[0.9, 0.999]]
1603
+ [2022-12-19 08:53:45,258] [INFO] [timer.py:196:stop] epoch=0/micro_step=4730/global_step=4730, RunningAvgSamplesPerSec=17.620074181001, CurrSamplesPerSec=17.436431911964664, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1604
+ [2022-12-19 08:56:31,906] [INFO] [logging.py:68:log_dist] [Rank 0] step=4740, skipped=8, lr=[5.977777777777778e-07], mom=[[0.9, 0.999]]
1605
+ [2022-12-19 08:56:31,908] [INFO] [timer.py:196:stop] epoch=0/micro_step=4740/global_step=4740, RunningAvgSamplesPerSec=17.620138380065285, CurrSamplesPerSec=17.718109635002346, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1606
+ [2022-12-19 08:58:49,423] [INFO] [logging.py:68:log_dist] [Rank 0] step=4750, skipped=8, lr=[5.755555555555555e-07], mom=[[0.9, 0.999]]
1607
+ [2022-12-19 08:58:49,424] [INFO] [timer.py:196:stop] epoch=0/micro_step=4750/global_step=4750, RunningAvgSamplesPerSec=17.620258476386315, CurrSamplesPerSec=17.843816764218595, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1608
+ {'loss': 0.0002, 'learning_rate': 5.755555555555555e-07, 'epoch': 40.02}
1609
+ [2022-12-19 09:02:02,128] [INFO] [logging.py:68:log_dist] [Rank 0] step=4760, skipped=8, lr=[5.533333333333334e-07], mom=[[0.9, 0.999]]
1610
+ [2022-12-19 09:02:02,129] [INFO] [timer.py:196:stop] epoch=0/micro_step=4760/global_step=4760, RunningAvgSamplesPerSec=17.620845234938383, CurrSamplesPerSec=17.56852084543164, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1611
+ [2022-12-19 09:04:49,552] [INFO] [logging.py:68:log_dist] [Rank 0] step=4770, skipped=8, lr=[5.311111111111111e-07], mom=[[0.9, 0.999]]
1612
+ [2022-12-19 09:04:49,553] [INFO] [timer.py:196:stop] epoch=0/micro_step=4770/global_step=4770, RunningAvgSamplesPerSec=17.620781538403822, CurrSamplesPerSec=17.993237206011308, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1613
+ {'loss': 0.0002, 'learning_rate': 5.2e-07, 'epoch': 41.0}
1614
+ [2022-12-19 09:07:39,410] [INFO] [logging.py:68:log_dist] [Rank 0] step=4780, skipped=8, lr=[5.088888888888889e-07], mom=[[0.9, 0.999]]
1615
+ [2022-12-19 09:07:39,411] [INFO] [timer.py:196:stop] epoch=0/micro_step=4780/global_step=4780, RunningAvgSamplesPerSec=17.620741159717415, CurrSamplesPerSec=17.62931691282607, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1616
+ [2022-12-19 09:10:31,165] [INFO] [logging.py:68:log_dist] [Rank 0] step=4790, skipped=8, lr=[4.866666666666666e-07], mom=[[0.9, 0.999]]
1617
+ [2022-12-19 09:10:31,166] [INFO] [timer.py:196:stop] epoch=0/micro_step=4790/global_step=4790, RunningAvgSamplesPerSec=17.621122717656146, CurrSamplesPerSec=17.9595660728393, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1618
+ [2022-12-19 09:13:22,998] [INFO] [logging.py:68:log_dist] [Rank 0] step=4800, skipped=8, lr=[4.6444444444444446e-07], mom=[[0.9, 0.999]]
1619
+ [2022-12-19 09:13:23,000] [INFO] [timer.py:196:stop] epoch=0/micro_step=4800/global_step=4800, RunningAvgSamplesPerSec=17.62146184710049, CurrSamplesPerSec=17.68574021221271, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1620
+ {'loss': 0.0002, 'learning_rate': 4.6444444444444446e-07, 'epoch': 41.01}
1621
+ [2022-12-19 09:16:20,425] [INFO] [logging.py:68:log_dist] [Rank 0] step=4810, skipped=8, lr=[4.422222222222223e-07], mom=[[0.9, 0.999]]
1622
+ [2022-12-19 09:16:20,426] [INFO] [timer.py:196:stop] epoch=0/micro_step=4810/global_step=4810, RunningAvgSamplesPerSec=17.62207009255359, CurrSamplesPerSec=17.919497743569675, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1623
+ [2022-12-19 09:19:17,092] [INFO] [logging.py:68:log_dist] [Rank 0] step=4820, skipped=8, lr=[4.2000000000000006e-07], mom=[[0.9, 0.999]]
1624
+ [2022-12-19 09:19:17,093] [INFO] [timer.py:196:stop] epoch=0/micro_step=4820/global_step=4820, RunningAvgSamplesPerSec=17.62281812490437, CurrSamplesPerSec=18.063956802967372, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1625
+ {'loss': 0.0002, 'learning_rate': 4.0888888888888897e-07, 'epoch': 41.01}
1626
+ [2022-12-19 09:22:11,760] [INFO] [logging.py:68:log_dist] [Rank 0] step=4830, skipped=8, lr=[3.9777777777777783e-07], mom=[[0.9, 0.999]]
1627
+ [2022-12-19 09:22:11,762] [INFO] [timer.py:196:stop] epoch=0/micro_step=4830/global_step=4830, RunningAvgSamplesPerSec=17.623217566897882, CurrSamplesPerSec=17.682097330909283, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1628
+ [2022-12-19 09:25:06,256] [INFO] [logging.py:68:log_dist] [Rank 0] step=4840, skipped=8, lr=[3.755555555555556e-07], mom=[[0.9, 0.999]]
1629
+ [2022-12-19 09:25:06,258] [INFO] [timer.py:196:stop] epoch=0/micro_step=4840/global_step=4840, RunningAvgSamplesPerSec=17.6239823084841, CurrSamplesPerSec=17.937164202870864, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1630
+ [2022-12-19 09:28:01,375] [INFO] [logging.py:68:log_dist] [Rank 0] step=4850, skipped=8, lr=[3.533333333333334e-07], mom=[[0.9, 0.999]]
1631
+ [2022-12-19 09:28:01,376] [INFO] [timer.py:196:stop] epoch=0/micro_step=4850/global_step=4850, RunningAvgSamplesPerSec=17.624502174178428, CurrSamplesPerSec=17.980924250155837, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1632
+ {'loss': 0.0002, 'learning_rate': 3.533333333333334e-07, 'epoch': 41.02}
1633
+ [2022-12-19 09:30:55,163] [INFO] [logging.py:68:log_dist] [Rank 0] step=4860, skipped=8, lr=[3.3111111111111115e-07], mom=[[0.9, 0.999]]
1634
+ [2022-12-19 09:30:55,164] [INFO] [timer.py:196:stop] epoch=0/micro_step=4860/global_step=4860, RunningAvgSamplesPerSec=17.62510883090857, CurrSamplesPerSec=17.68973547651362, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1635
+ [2022-12-19 09:32:25,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=4870, skipped=8, lr=[3.088888888888889e-07], mom=[[0.9, 0.999]]
1636
+ [2022-12-19 09:32:25,740] [INFO] [timer.py:196:stop] epoch=0/micro_step=4870/global_step=4870, RunningAvgSamplesPerSec=17.6258822208131, CurrSamplesPerSec=18.0027099754836, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1637
+ {'loss': 0.0002, 'learning_rate': 2.977777777777778e-07, 'epoch': 42.0}
1638
+ [2022-12-19 09:36:36,623] [INFO] [logging.py:68:log_dist] [Rank 0] step=4880, skipped=8, lr=[2.866666666666667e-07], mom=[[0.9, 0.999]]
1639
+ [2022-12-19 09:36:36,625] [INFO] [timer.py:196:stop] epoch=0/micro_step=4880/global_step=4880, RunningAvgSamplesPerSec=17.627590163724953, CurrSamplesPerSec=17.970300168807466, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1640
+ [2022-12-19 09:39:29,570] [INFO] [logging.py:68:log_dist] [Rank 0] step=4890, skipped=8, lr=[2.6444444444444447e-07], mom=[[0.9, 0.999]]
1641
+ [2022-12-19 09:39:29,571] [INFO] [timer.py:196:stop] epoch=0/micro_step=4890/global_step=4890, RunningAvgSamplesPerSec=17.628267118175295, CurrSamplesPerSec=17.953751119369517, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1642
+ [2022-12-19 09:42:20,232] [INFO] [logging.py:68:log_dist] [Rank 0] step=4900, skipped=8, lr=[2.4222222222222224e-07], mom=[[0.9, 0.999]]
1643
+ [2022-12-19 09:42:20,233] [INFO] [timer.py:196:stop] epoch=0/micro_step=4900/global_step=4900, RunningAvgSamplesPerSec=17.628906943630938, CurrSamplesPerSec=17.925920168070032, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1644
+ {'loss': 0.0002, 'learning_rate': 2.4222222222222224e-07, 'epoch': 42.01}
1645
+ [2022-12-19 09:45:10,151] [INFO] [logging.py:68:log_dist] [Rank 0] step=4910, skipped=8, lr=[2.2e-07], mom=[[0.9, 0.999]]
1646
+ [2022-12-19 09:45:10,152] [INFO] [timer.py:196:stop] epoch=0/micro_step=4910/global_step=4910, RunningAvgSamplesPerSec=17.629482874220002, CurrSamplesPerSec=17.81305045026094, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1647
+ [2022-12-19 09:47:55,909] [INFO] [logging.py:68:log_dist] [Rank 0] step=4920, skipped=8, lr=[1.9777777777777778e-07], mom=[[0.9, 0.999]]
1648
+ [2022-12-19 09:47:55,910] [INFO] [timer.py:196:stop] epoch=0/micro_step=4920/global_step=4920, RunningAvgSamplesPerSec=17.629964973280497, CurrSamplesPerSec=17.82848486728267, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1649
+ {'loss': 0.0002, 'learning_rate': 1.866666666666667e-07, 'epoch': 42.01}
1650
+ [2022-12-19 09:50:45,106] [INFO] [logging.py:68:log_dist] [Rank 0] step=4930, skipped=8, lr=[1.7555555555555558e-07], mom=[[0.9, 0.999]]
1651
+ [2022-12-19 09:50:45,108] [INFO] [timer.py:196:stop] epoch=0/micro_step=4930/global_step=4930, RunningAvgSamplesPerSec=17.630395440620926, CurrSamplesPerSec=17.707026340076307, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1652
+ [2022-12-19 09:53:36,465] [INFO] [logging.py:68:log_dist] [Rank 0] step=4940, skipped=8, lr=[1.5333333333333333e-07], mom=[[0.9, 0.999]]
1653
+ [2022-12-19 09:53:36,466] [INFO] [timer.py:196:stop] epoch=0/micro_step=4940/global_step=4940, RunningAvgSamplesPerSec=17.631168427205882, CurrSamplesPerSec=18.003925866621273, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1654
+ [2022-12-19 09:56:31,916] [INFO] [logging.py:68:log_dist] [Rank 0] step=4950, skipped=8, lr=[1.3111111111111113e-07], mom=[[0.9, 0.999]]
1655
+ [2022-12-19 09:56:31,917] [INFO] [timer.py:196:stop] epoch=0/micro_step=4950/global_step=4950, RunningAvgSamplesPerSec=17.63176081163297, CurrSamplesPerSec=18.076746529204566, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1656
+ {'loss': 0.0002, 'learning_rate': 1.3111111111111113e-07, 'epoch': 42.02}
1657
+ [2022-12-19 09:59:24,350] [INFO] [logging.py:68:log_dist] [Rank 0] step=4960, skipped=8, lr=[1.088888888888889e-07], mom=[[0.9, 0.999]]
1658
+ [2022-12-19 09:59:24,352] [INFO] [timer.py:196:stop] epoch=0/micro_step=4960/global_step=4960, RunningAvgSamplesPerSec=17.632263561334792, CurrSamplesPerSec=17.887386586858412, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1659
+ [2022-12-19 10:02:21,261] [INFO] [logging.py:68:log_dist] [Rank 0] step=4970, skipped=8, lr=[8.666666666666668e-08], mom=[[0.9, 0.999]]
1660
+ [2022-12-19 10:02:21,262] [INFO] [timer.py:196:stop] epoch=0/micro_step=4970/global_step=4970, RunningAvgSamplesPerSec=17.632836729122953, CurrSamplesPerSec=18.04230886853236, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1661
+ {'loss': 0.0002, 'learning_rate': 7.555555555555556e-08, 'epoch': 42.02}
1662
+ [2022-12-19 10:05:14,832] [INFO] [logging.py:68:log_dist] [Rank 0] step=4980, skipped=8, lr=[6.444444444444445e-08], mom=[[0.9, 0.999]]
1663
+ [2022-12-19 10:05:14,834] [INFO] [timer.py:196:stop] epoch=0/micro_step=4980/global_step=4980, RunningAvgSamplesPerSec=17.633470780183274, CurrSamplesPerSec=17.935921359635877, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1664
+ [2022-12-19 10:08:03,799] [INFO] [logging.py:68:log_dist] [Rank 0] step=4990, skipped=8, lr=[4.222222222222222e-08], mom=[[0.9, 0.999]]
1665
+ [2022-12-19 10:08:03,800] [INFO] [timer.py:196:stop] epoch=0/micro_step=4990/global_step=4990, RunningAvgSamplesPerSec=17.635077496959436, CurrSamplesPerSec=18.20354202004553, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1666
+ [2022-12-19 10:10:57,893] [INFO] [logging.py:68:log_dist] [Rank 0] step=5000, skipped=8, lr=[2e-08], mom=[[0.9, 0.999]]
1667
+ [2022-12-19 10:10:57,895] [INFO] [timer.py:196:stop] epoch=0/micro_step=5000/global_step=5000, RunningAvgSamplesPerSec=17.635593279087654, CurrSamplesPerSec=18.094030560717506, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1668
+ {'loss': 0.0002, 'learning_rate': 2e-08, 'epoch': 43.0}
1669
+ {'eval_loss': 0.3427734375, 'eval_wer': 17.804826268487723, 'eval_runtime': 1211.7822, 'eval_samples_per_second': 3.185, 'eval_steps_per_second': 0.1, 'epoch': 43.0}
1670
+ [2022-12-19 10:31:10,745] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step5000 is begin to save!
1671
+ [2022-12-19 10:31:10,753] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt
1672
+ [2022-12-19 10:31:10,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt...
1673
+ [2022-12-19 10:31:11,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt.
1674
+ [2022-12-19 10:31:11,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
1675
+ [2022-12-19 10:31:15,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1676
+ [2022-12-19 10:31:15,939] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1677
+ [2022-12-19 10:31:15,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now!
runs/Dec18_08-41-04_fe2747a042f0/events.out.tfevents.1671381730.fe2747a042f0.46148.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f809800ca6de0a5be418d3f2830d96d4dc56d4a9ab41574b9b5ebe7730f0eee9
3
- size 30653
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8902c3935bdd48437262f5153bb5bdab3bf7777a2e2193fb19c4db3fc98b8f31
3
+ size 37251