Training in progress, step 5000
Browse files
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 483536061
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9e1062e0a39ac80eef975c1aeed734d737cb5183f243dac42da4ea28a58928cc
|
3 |
size 483536061
|
run.log
CHANGED
@@ -1424,3 +1424,254 @@ Rank: 0 partition count [1] and sizes[(241734912, False)]
|
|
1424 |
[2022-12-19 05:19:38,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1425 |
[2022-12-19 05:19:38,828] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1426 |
[2022-12-19 05:19:38,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1424 |
[2022-12-19 05:19:38,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1425 |
[2022-12-19 05:19:38,828] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1426 |
[2022-12-19 05:19:38,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
|
1427 |
+
[2022-12-19 05:22:56,912] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 65536.0
|
1428 |
+
[2022-12-19 05:23:13,569] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
|
1429 |
+
[2022-12-19 05:23:46,881] [INFO] [logging.py:68:log_dist] [Rank 0] step=4010, skipped=8, lr=[2.2200000000000003e-06], mom=[[0.9, 0.999]]
|
1430 |
+
[2022-12-19 05:23:46,883] [INFO] [timer.py:196:stop] epoch=0/micro_step=4010/global_step=4010, RunningAvgSamplesPerSec=17.611133683583187, CurrSamplesPerSec=17.462168466721625, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1431 |
+
[2022-12-19 05:26:38,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=4020, skipped=8, lr=[2.197777777777778e-06], mom=[[0.9, 0.999]]
|
1432 |
+
[2022-12-19 05:26:38,750] [INFO] [timer.py:196:stop] epoch=0/micro_step=4020/global_step=4020, RunningAvgSamplesPerSec=17.611368779669082, CurrSamplesPerSec=17.378475405939998, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1433 |
+
{'loss': 0.0002, 'learning_rate': 2.1866666666666668e-06, 'epoch': 34.02}
|
1434 |
+
[2022-12-19 05:29:37,420] [INFO] [logging.py:68:log_dist] [Rank 0] step=4030, skipped=8, lr=[2.1755555555555556e-06], mom=[[0.9, 0.999]]
|
1435 |
+
[2022-12-19 05:29:37,422] [INFO] [timer.py:196:stop] epoch=0/micro_step=4030/global_step=4030, RunningAvgSamplesPerSec=17.61113977607008, CurrSamplesPerSec=17.737646360030492, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1436 |
+
[2022-12-19 05:32:38,897] [INFO] [logging.py:68:log_dist] [Rank 0] step=4040, skipped=8, lr=[2.153333333333333e-06], mom=[[0.9, 0.999]]
|
1437 |
+
[2022-12-19 05:32:38,898] [INFO] [timer.py:196:stop] epoch=0/micro_step=4040/global_step=4040, RunningAvgSamplesPerSec=17.610970462806403, CurrSamplesPerSec=17.397608183728696, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1438 |
+
[2022-12-19 05:35:39,166] [INFO] [logging.py:68:log_dist] [Rank 0] step=4050, skipped=8, lr=[2.1311111111111112e-06], mom=[[0.9, 0.999]]
|
1439 |
+
[2022-12-19 05:35:39,168] [INFO] [timer.py:196:stop] epoch=0/micro_step=4050/global_step=4050, RunningAvgSamplesPerSec=17.611253470031247, CurrSamplesPerSec=17.833383657490664, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1440 |
+
{'loss': 0.0002, 'learning_rate': 2.1311111111111112e-06, 'epoch': 34.02}
|
1441 |
+
[2022-12-19 05:36:43,776] [INFO] [logging.py:68:log_dist] [Rank 0] step=4060, skipped=8, lr=[2.108888888888889e-06], mom=[[0.9, 0.999]]
|
1442 |
+
[2022-12-19 05:36:43,777] [INFO] [timer.py:196:stop] epoch=0/micro_step=4060/global_step=4060, RunningAvgSamplesPerSec=17.612337722470095, CurrSamplesPerSec=23.447783642928687, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1443 |
+
[2022-12-19 05:41:45,291] [INFO] [logging.py:68:log_dist] [Rank 0] step=4070, skipped=8, lr=[2.086666666666667e-06], mom=[[0.9, 0.999]]
|
1444 |
+
[2022-12-19 05:41:45,292] [INFO] [timer.py:196:stop] epoch=0/micro_step=4070/global_step=4070, RunningAvgSamplesPerSec=17.612470816100952, CurrSamplesPerSec=17.877452530444263, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1445 |
+
{'loss': 0.0002, 'learning_rate': 2.0755555555555557e-06, 'epoch': 35.0}
|
1446 |
+
[2022-12-19 05:44:43,364] [INFO] [logging.py:68:log_dist] [Rank 0] step=4080, skipped=8, lr=[2.064444444444445e-06], mom=[[0.9, 0.999]]
|
1447 |
+
[2022-12-19 05:44:43,365] [INFO] [timer.py:196:stop] epoch=0/micro_step=4080/global_step=4080, RunningAvgSamplesPerSec=17.612796189774304, CurrSamplesPerSec=17.736692349537687, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1448 |
+
[2022-12-19 05:47:41,092] [INFO] [logging.py:68:log_dist] [Rank 0] step=4090, skipped=8, lr=[2.0422222222222225e-06], mom=[[0.9, 0.999]]
|
1449 |
+
[2022-12-19 05:47:41,093] [INFO] [timer.py:196:stop] epoch=0/micro_step=4090/global_step=4090, RunningAvgSamplesPerSec=17.612917765334718, CurrSamplesPerSec=17.74363646795449, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1450 |
+
[2022-12-19 05:50:36,586] [INFO] [logging.py:68:log_dist] [Rank 0] step=4100, skipped=8, lr=[2.02e-06], mom=[[0.9, 0.999]]
|
1451 |
+
[2022-12-19 05:50:36,587] [INFO] [timer.py:196:stop] epoch=0/micro_step=4100/global_step=4100, RunningAvgSamplesPerSec=17.61301264236775, CurrSamplesPerSec=17.761625214894558, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1452 |
+
{'loss': 0.0002, 'learning_rate': 2.02e-06, 'epoch': 35.01}
|
1453 |
+
[2022-12-19 05:53:34,916] [INFO] [logging.py:68:log_dist] [Rank 0] step=4110, skipped=8, lr=[1.9977777777777778e-06], mom=[[0.9, 0.999]]
|
1454 |
+
[2022-12-19 05:53:34,917] [INFO] [timer.py:196:stop] epoch=0/micro_step=4110/global_step=4110, RunningAvgSamplesPerSec=17.61307623650117, CurrSamplesPerSec=17.558381822907993, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1455 |
+
[2022-12-19 05:56:32,454] [INFO] [logging.py:68:log_dist] [Rank 0] step=4120, skipped=8, lr=[1.975555555555556e-06], mom=[[0.9, 0.999]]
|
1456 |
+
[2022-12-19 05:56:32,456] [INFO] [timer.py:196:stop] epoch=0/micro_step=4120/global_step=4120, RunningAvgSamplesPerSec=17.61328460598005, CurrSamplesPerSec=17.849423032435844, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1457 |
+
{'loss': 0.0002, 'learning_rate': 1.9644444444444446e-06, 'epoch': 35.01}
|
1458 |
+
[2022-12-19 05:59:28,958] [INFO] [logging.py:68:log_dist] [Rank 0] step=4130, skipped=8, lr=[1.9533333333333334e-06], mom=[[0.9, 0.999]]
|
1459 |
+
[2022-12-19 05:59:28,959] [INFO] [timer.py:196:stop] epoch=0/micro_step=4130/global_step=4130, RunningAvgSamplesPerSec=17.61368146124737, CurrSamplesPerSec=17.80531019001806, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1460 |
+
[2022-12-19 06:02:37,403] [INFO] [logging.py:68:log_dist] [Rank 0] step=4140, skipped=8, lr=[1.9311111111111114e-06], mom=[[0.9, 0.999]]
|
1461 |
+
[2022-12-19 06:02:37,404] [INFO] [timer.py:196:stop] epoch=0/micro_step=4140/global_step=4140, RunningAvgSamplesPerSec=17.613849828726913, CurrSamplesPerSec=17.741122217014265, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1462 |
+
[2022-12-19 06:05:39,451] [INFO] [logging.py:68:log_dist] [Rank 0] step=4150, skipped=8, lr=[1.908888888888889e-06], mom=[[0.9, 0.999]]
|
1463 |
+
[2022-12-19 06:05:39,452] [INFO] [timer.py:196:stop] epoch=0/micro_step=4150/global_step=4150, RunningAvgSamplesPerSec=17.61420113756101, CurrSamplesPerSec=17.687389147202236, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1464 |
+
{'loss': 0.0002, 'learning_rate': 1.908888888888889e-06, 'epoch': 35.02}
|
1465 |
+
[2022-12-19 06:08:34,312] [INFO] [logging.py:68:log_dist] [Rank 0] step=4160, skipped=8, lr=[1.8866666666666669e-06], mom=[[0.9, 0.999]]
|
1466 |
+
[2022-12-19 06:08:34,313] [INFO] [timer.py:196:stop] epoch=0/micro_step=4160/global_step=4160, RunningAvgSamplesPerSec=17.614407249990197, CurrSamplesPerSec=17.847925309782504, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1467 |
+
[2022-12-19 06:11:07,038] [INFO] [logging.py:68:log_dist] [Rank 0] step=4170, skipped=8, lr=[1.8644444444444445e-06], mom=[[0.9, 0.999]]
|
1468 |
+
[2022-12-19 06:11:07,039] [INFO] [timer.py:196:stop] epoch=0/micro_step=4170/global_step=4170, RunningAvgSamplesPerSec=17.614796295687643, CurrSamplesPerSec=17.6957562558827, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1469 |
+
{'loss': 0.0002, 'learning_rate': 1.8533333333333333e-06, 'epoch': 35.02}
|
1470 |
+
[2022-12-19 06:14:26,808] [INFO] [logging.py:68:log_dist] [Rank 0] step=4180, skipped=8, lr=[1.8422222222222225e-06], mom=[[0.9, 0.999]]
|
1471 |
+
[2022-12-19 06:14:26,809] [INFO] [timer.py:196:stop] epoch=0/micro_step=4180/global_step=4180, RunningAvgSamplesPerSec=17.615978855666164, CurrSamplesPerSec=17.517680699168547, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1472 |
+
[2022-12-19 06:17:15,114] [INFO] [logging.py:68:log_dist] [Rank 0] step=4190, skipped=8, lr=[1.8200000000000002e-06], mom=[[0.9, 0.999]]
|
1473 |
+
[2022-12-19 06:17:15,116] [INFO] [timer.py:196:stop] epoch=0/micro_step=4190/global_step=4190, RunningAvgSamplesPerSec=17.616242542847015, CurrSamplesPerSec=17.71552661785773, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1474 |
+
[2022-12-19 06:20:05,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=4200, skipped=8, lr=[1.797777777777778e-06], mom=[[0.9, 0.999]]
|
1475 |
+
[2022-12-19 06:20:05,756] [INFO] [timer.py:196:stop] epoch=0/micro_step=4200/global_step=4200, RunningAvgSamplesPerSec=17.61609844770078, CurrSamplesPerSec=17.366084738146398, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1476 |
+
{'loss': 0.0002, 'learning_rate': 1.797777777777778e-06, 'epoch': 36.0}
|
1477 |
+
[2022-12-19 06:22:59,727] [INFO] [logging.py:68:log_dist] [Rank 0] step=4210, skipped=8, lr=[1.7755555555555556e-06], mom=[[0.9, 0.999]]
|
1478 |
+
[2022-12-19 06:22:59,728] [INFO] [timer.py:196:stop] epoch=0/micro_step=4210/global_step=4210, RunningAvgSamplesPerSec=17.616157473369498, CurrSamplesPerSec=17.633430354752342, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1479 |
+
[2022-12-19 06:25:55,318] [INFO] [logging.py:68:log_dist] [Rank 0] step=4220, skipped=8, lr=[1.7533333333333336e-06], mom=[[0.9, 0.999]]
|
1480 |
+
[2022-12-19 06:25:55,320] [INFO] [timer.py:196:stop] epoch=0/micro_step=4220/global_step=4220, RunningAvgSamplesPerSec=17.616198217116857, CurrSamplesPerSec=17.707843993918647, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1481 |
+
{'loss': 0.0002, 'learning_rate': 1.7422222222222224e-06, 'epoch': 36.01}
|
1482 |
+
[2022-12-19 06:28:57,297] [INFO] [logging.py:68:log_dist] [Rank 0] step=4230, skipped=8, lr=[1.7311111111111112e-06], mom=[[0.9, 0.999]]
|
1483 |
+
[2022-12-19 06:28:57,298] [INFO] [timer.py:196:stop] epoch=0/micro_step=4230/global_step=4230, RunningAvgSamplesPerSec=17.61652951646244, CurrSamplesPerSec=17.71409803827759, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1484 |
+
[2022-12-19 06:31:59,467] [INFO] [logging.py:68:log_dist] [Rank 0] step=4240, skipped=8, lr=[1.708888888888889e-06], mom=[[0.9, 0.999]]
|
1485 |
+
[2022-12-19 06:31:59,468] [INFO] [timer.py:196:stop] epoch=0/micro_step=4240/global_step=4240, RunningAvgSamplesPerSec=17.617018924776456, CurrSamplesPerSec=17.79433342574028, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1486 |
+
[2022-12-19 06:34:52,886] [INFO] [logging.py:68:log_dist] [Rank 0] step=4250, skipped=8, lr=[1.6866666666666667e-06], mom=[[0.9, 0.999]]
|
1487 |
+
[2022-12-19 06:34:52,888] [INFO] [timer.py:196:stop] epoch=0/micro_step=4250/global_step=4250, RunningAvgSamplesPerSec=17.617133665736407, CurrSamplesPerSec=17.89488941065793, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1488 |
+
{'loss': 0.0002, 'learning_rate': 1.6866666666666667e-06, 'epoch': 36.01}
|
1489 |
+
[2022-12-19 06:37:46,872] [INFO] [logging.py:68:log_dist] [Rank 0] step=4260, skipped=8, lr=[1.6644444444444447e-06], mom=[[0.9, 0.999]]
|
1490 |
+
[2022-12-19 06:37:46,874] [INFO] [timer.py:196:stop] epoch=0/micro_step=4260/global_step=4260, RunningAvgSamplesPerSec=17.61709970360881, CurrSamplesPerSec=17.72886490216432, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1491 |
+
[2022-12-19 06:40:46,545] [INFO] [logging.py:68:log_dist] [Rank 0] step=4270, skipped=8, lr=[1.6422222222222223e-06], mom=[[0.9, 0.999]]
|
1492 |
+
[2022-12-19 06:40:46,546] [INFO] [timer.py:196:stop] epoch=0/micro_step=4270/global_step=4270, RunningAvgSamplesPerSec=17.61723647083081, CurrSamplesPerSec=17.50934176766855, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1493 |
+
{'loss': 0.0002, 'learning_rate': 1.6311111111111114e-06, 'epoch': 36.02}
|
1494 |
+
[2022-12-19 06:43:41,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=4280, skipped=8, lr=[1.6200000000000002e-06], mom=[[0.9, 0.999]]
|
1495 |
+
[2022-12-19 06:43:41,741] [INFO] [timer.py:196:stop] epoch=0/micro_step=4280/global_step=4280, RunningAvgSamplesPerSec=17.617458572115385, CurrSamplesPerSec=17.8233958672181, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1496 |
+
[2022-12-19 06:45:13,558] [INFO] [logging.py:68:log_dist] [Rank 0] step=4290, skipped=8, lr=[1.5977777777777778e-06], mom=[[0.9, 0.999]]
|
1497 |
+
[2022-12-19 06:45:13,560] [INFO] [timer.py:196:stop] epoch=0/micro_step=4290/global_step=4290, RunningAvgSamplesPerSec=17.61748707733689, CurrSamplesPerSec=17.694014788106895, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1498 |
+
[2022-12-19 06:49:29,081] [INFO] [logging.py:68:log_dist] [Rank 0] step=4300, skipped=8, lr=[1.5755555555555558e-06], mom=[[0.9, 0.999]]
|
1499 |
+
[2022-12-19 06:49:29,083] [INFO] [timer.py:196:stop] epoch=0/micro_step=4300/global_step=4300, RunningAvgSamplesPerSec=17.618675777869846, CurrSamplesPerSec=17.604989397739672, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1500 |
+
{'loss': 0.0002, 'learning_rate': 1.5755555555555558e-06, 'epoch': 37.0}
|
1501 |
+
[2022-12-19 06:52:25,502] [INFO] [logging.py:68:log_dist] [Rank 0] step=4310, skipped=8, lr=[1.5533333333333334e-06], mom=[[0.9, 0.999]]
|
1502 |
+
[2022-12-19 06:52:25,503] [INFO] [timer.py:196:stop] epoch=0/micro_step=4310/global_step=4310, RunningAvgSamplesPerSec=17.618917658378976, CurrSamplesPerSec=17.888580989149183, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1503 |
+
[2022-12-19 06:55:21,753] [INFO] [logging.py:68:log_dist] [Rank 0] step=4320, skipped=8, lr=[1.5311111111111113e-06], mom=[[0.9, 0.999]]
|
1504 |
+
[2022-12-19 06:55:21,754] [INFO] [timer.py:196:stop] epoch=0/micro_step=4320/global_step=4320, RunningAvgSamplesPerSec=17.619314539730464, CurrSamplesPerSec=17.772010979056823, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1505 |
+
{'loss': 0.0002, 'learning_rate': 1.52e-06, 'epoch': 37.01}
|
1506 |
+
[2022-12-19 06:58:18,142] [INFO] [logging.py:68:log_dist] [Rank 0] step=4330, skipped=8, lr=[1.5088888888888889e-06], mom=[[0.9, 0.999]]
|
1507 |
+
[2022-12-19 06:58:18,144] [INFO] [timer.py:196:stop] epoch=0/micro_step=4330/global_step=4330, RunningAvgSamplesPerSec=17.619612314250787, CurrSamplesPerSec=17.79300532950759, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1508 |
+
[2022-12-19 07:01:11,886] [INFO] [logging.py:68:log_dist] [Rank 0] step=4340, skipped=8, lr=[1.486666666666667e-06], mom=[[0.9, 0.999]]
|
1509 |
+
[2022-12-19 07:01:11,888] [INFO] [timer.py:196:stop] epoch=0/micro_step=4340/global_step=4340, RunningAvgSamplesPerSec=17.619731554882748, CurrSamplesPerSec=17.831246620133083, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1510 |
+
[2022-12-19 07:04:10,104] [INFO] [logging.py:68:log_dist] [Rank 0] step=4350, skipped=8, lr=[1.4644444444444445e-06], mom=[[0.9, 0.999]]
|
1511 |
+
[2022-12-19 07:04:10,105] [INFO] [timer.py:196:stop] epoch=0/micro_step=4350/global_step=4350, RunningAvgSamplesPerSec=17.619778835395326, CurrSamplesPerSec=17.918863768337292, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1512 |
+
{'loss': 0.0002, 'learning_rate': 1.4644444444444445e-06, 'epoch': 37.01}
|
1513 |
+
[2022-12-19 07:07:18,279] [INFO] [logging.py:68:log_dist] [Rank 0] step=4360, skipped=8, lr=[1.4422222222222223e-06], mom=[[0.9, 0.999]]
|
1514 |
+
[2022-12-19 07:07:18,280] [INFO] [timer.py:196:stop] epoch=0/micro_step=4360/global_step=4360, RunningAvgSamplesPerSec=17.61996490248731, CurrSamplesPerSec=17.26081994397919, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1515 |
+
[2022-12-19 07:10:24,446] [INFO] [logging.py:68:log_dist] [Rank 0] step=4370, skipped=8, lr=[1.42e-06], mom=[[0.9, 0.999]]
|
1516 |
+
[2022-12-19 07:10:24,448] [INFO] [timer.py:196:stop] epoch=0/micro_step=4370/global_step=4370, RunningAvgSamplesPerSec=17.62001193317469, CurrSamplesPerSec=17.401952624968608, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1517 |
+
{'loss': 0.0002, 'learning_rate': 1.4088888888888892e-06, 'epoch': 37.02}
|
1518 |
+
[2022-12-19 07:13:17,536] [INFO] [logging.py:68:log_dist] [Rank 0] step=4380, skipped=8, lr=[1.397777777777778e-06], mom=[[0.9, 0.999]]
|
1519 |
+
[2022-12-19 07:13:17,537] [INFO] [timer.py:196:stop] epoch=0/micro_step=4380/global_step=4380, RunningAvgSamplesPerSec=17.619935746572583, CurrSamplesPerSec=17.670992723198474, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1520 |
+
[2022-12-19 07:16:09,586] [INFO] [logging.py:68:log_dist] [Rank 0] step=4390, skipped=8, lr=[1.3755555555555556e-06], mom=[[0.9, 0.999]]
|
1521 |
+
[2022-12-19 07:16:09,588] [INFO] [timer.py:196:stop] epoch=0/micro_step=4390/global_step=4390, RunningAvgSamplesPerSec=17.62012413538634, CurrSamplesPerSec=17.81988415153239, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1522 |
+
[2022-12-19 07:19:00,421] [INFO] [logging.py:68:log_dist] [Rank 0] step=4400, skipped=8, lr=[1.3533333333333334e-06], mom=[[0.9, 0.999]]
|
1523 |
+
[2022-12-19 07:19:00,423] [INFO] [timer.py:196:stop] epoch=0/micro_step=4400/global_step=4400, RunningAvgSamplesPerSec=17.620059714852356, CurrSamplesPerSec=17.51520378570747, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1524 |
+
{'loss': 0.0002, 'learning_rate': 1.3533333333333334e-06, 'epoch': 37.02}
|
1525 |
+
[2022-12-19 07:21:47,475] [INFO] [logging.py:68:log_dist] [Rank 0] step=4410, skipped=8, lr=[1.3311111111111113e-06], mom=[[0.9, 0.999]]
|
1526 |
+
[2022-12-19 07:21:47,477] [INFO] [timer.py:196:stop] epoch=0/micro_step=4410/global_step=4410, RunningAvgSamplesPerSec=17.621140582831963, CurrSamplesPerSec=17.79351248317827, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1527 |
+
[2022-12-19 07:24:37,405] [INFO] [logging.py:68:log_dist] [Rank 0] step=4420, skipped=8, lr=[1.308888888888889e-06], mom=[[0.9, 0.999]]
|
1528 |
+
[2022-12-19 07:24:37,407] [INFO] [timer.py:196:stop] epoch=0/micro_step=4420/global_step=4420, RunningAvgSamplesPerSec=17.621243226652567, CurrSamplesPerSec=17.81804129124176, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1529 |
+
{'loss': 0.0002, 'learning_rate': 1.2977777777777779e-06, 'epoch': 38.0}
|
1530 |
+
[2022-12-19 07:27:30,942] [INFO] [logging.py:68:log_dist] [Rank 0] step=4430, skipped=8, lr=[1.286666666666667e-06], mom=[[0.9, 0.999]]
|
1531 |
+
[2022-12-19 07:27:30,943] [INFO] [timer.py:196:stop] epoch=0/micro_step=4430/global_step=4430, RunningAvgSamplesPerSec=17.62120393290399, CurrSamplesPerSec=17.173806418163036, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1532 |
+
[2022-12-19 07:30:23,275] [INFO] [logging.py:68:log_dist] [Rank 0] step=4440, skipped=8, lr=[1.2644444444444445e-06], mom=[[0.9, 0.999]]
|
1533 |
+
[2022-12-19 07:30:23,277] [INFO] [timer.py:196:stop] epoch=0/micro_step=4440/global_step=4440, RunningAvgSamplesPerSec=17.621216006629336, CurrSamplesPerSec=17.66344978593792, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1534 |
+
[2022-12-19 07:33:14,028] [INFO] [logging.py:68:log_dist] [Rank 0] step=4450, skipped=8, lr=[1.2422222222222224e-06], mom=[[0.9, 0.999]]
|
1535 |
+
[2022-12-19 07:33:14,029] [INFO] [timer.py:196:stop] epoch=0/micro_step=4450/global_step=4450, RunningAvgSamplesPerSec=17.621087033427646, CurrSamplesPerSec=17.209054639288738, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1536 |
+
{'loss': 0.0002, 'learning_rate': 1.2422222222222224e-06, 'epoch': 38.01}
|
1537 |
+
[2022-12-19 07:36:06,709] [INFO] [logging.py:68:log_dist] [Rank 0] step=4460, skipped=8, lr=[1.2200000000000002e-06], mom=[[0.9, 0.999]]
|
1538 |
+
[2022-12-19 07:36:06,710] [INFO] [timer.py:196:stop] epoch=0/micro_step=4460/global_step=4460, RunningAvgSamplesPerSec=17.620917138316685, CurrSamplesPerSec=17.78626293383821, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1539 |
+
[2022-12-19 07:39:01,972] [INFO] [logging.py:68:log_dist] [Rank 0] step=4470, skipped=8, lr=[1.1977777777777778e-06], mom=[[0.9, 0.999]]
|
1540 |
+
[2022-12-19 07:39:01,973] [INFO] [timer.py:196:stop] epoch=0/micro_step=4470/global_step=4470, RunningAvgSamplesPerSec=17.620983810960013, CurrSamplesPerSec=17.796430947697868, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1541 |
+
{'loss': 0.0002, 'learning_rate': 1.1866666666666668e-06, 'epoch': 38.01}
|
1542 |
+
[2022-12-19 07:42:03,726] [INFO] [logging.py:68:log_dist] [Rank 0] step=4480, skipped=8, lr=[1.1755555555555556e-06], mom=[[0.9, 0.999]]
|
1543 |
+
[2022-12-19 07:42:03,728] [INFO] [timer.py:196:stop] epoch=0/micro_step=4480/global_step=4480, RunningAvgSamplesPerSec=17.62096470995377, CurrSamplesPerSec=17.768253674908266, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1544 |
+
[2022-12-19 07:44:54,699] [INFO] [logging.py:68:log_dist] [Rank 0] step=4490, skipped=8, lr=[1.1533333333333334e-06], mom=[[0.9, 0.999]]
|
1545 |
+
[2022-12-19 07:44:54,700] [INFO] [timer.py:196:stop] epoch=0/micro_step=4490/global_step=4490, RunningAvgSamplesPerSec=17.620810807024053, CurrSamplesPerSec=17.12529074539211, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1546 |
+
[2022-12-19 07:47:45,410] [INFO] [logging.py:68:log_dist] [Rank 0] step=4500, skipped=8, lr=[1.131111111111111e-06], mom=[[0.9, 0.999]]
|
1547 |
+
[2022-12-19 07:47:45,411] [INFO] [timer.py:196:stop] epoch=0/micro_step=4500/global_step=4500, RunningAvgSamplesPerSec=17.62074008642063, CurrSamplesPerSec=17.679443294912847, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1548 |
+
{'loss': 0.0002, 'learning_rate': 1.131111111111111e-06, 'epoch': 38.02}
|
1549 |
+
[2022-12-19 07:50:35,915] [INFO] [logging.py:68:log_dist] [Rank 0] step=4510, skipped=8, lr=[1.1088888888888889e-06], mom=[[0.9, 0.999]]
|
1550 |
+
[2022-12-19 07:50:35,917] [INFO] [timer.py:196:stop] epoch=0/micro_step=4510/global_step=4510, RunningAvgSamplesPerSec=17.620564657864747, CurrSamplesPerSec=17.742456651398978, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1551 |
+
[2022-12-19 07:52:31,359] [INFO] [logging.py:68:log_dist] [Rank 0] step=4520, skipped=8, lr=[1.0866666666666667e-06], mom=[[0.9, 0.999]]
|
1552 |
+
[2022-12-19 07:52:31,361] [INFO] [timer.py:196:stop] epoch=0/micro_step=4520/global_step=4520, RunningAvgSamplesPerSec=17.62064058371775, CurrSamplesPerSec=17.814565970324615, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1553 |
+
{'loss': 0.0002, 'learning_rate': 1.0755555555555557e-06, 'epoch': 39.0}
|
1554 |
+
[2022-12-19 07:56:08,715] [INFO] [logging.py:68:log_dist] [Rank 0] step=4530, skipped=8, lr=[1.0644444444444445e-06], mom=[[0.9, 0.999]]
|
1555 |
+
[2022-12-19 07:56:08,716] [INFO] [timer.py:196:stop] epoch=0/micro_step=4530/global_step=4530, RunningAvgSamplesPerSec=17.62153945771267, CurrSamplesPerSec=17.538607510040734, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1556 |
+
[2022-12-19 07:58:58,490] [INFO] [logging.py:68:log_dist] [Rank 0] step=4540, skipped=8, lr=[1.0422222222222221e-06], mom=[[0.9, 0.999]]
|
1557 |
+
[2022-12-19 07:58:58,492] [INFO] [timer.py:196:stop] epoch=0/micro_step=4540/global_step=4540, RunningAvgSamplesPerSec=17.621602807771676, CurrSamplesPerSec=17.587700090174195, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1558 |
+
[2022-12-19 08:01:48,409] [INFO] [logging.py:68:log_dist] [Rank 0] step=4550, skipped=8, lr=[1.02e-06], mom=[[0.9, 0.999]]
|
1559 |
+
[2022-12-19 08:01:48,411] [INFO] [timer.py:196:stop] epoch=0/micro_step=4550/global_step=4550, RunningAvgSamplesPerSec=17.62183130814374, CurrSamplesPerSec=17.59054796755746, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1560 |
+
{'loss': 0.0002, 'learning_rate': 1.02e-06, 'epoch': 39.01}
|
1561 |
+
[2022-12-19 08:04:41,762] [INFO] [logging.py:68:log_dist] [Rank 0] step=4560, skipped=8, lr=[9.97777777777778e-07], mom=[[0.9, 0.999]]
|
1562 |
+
[2022-12-19 08:04:41,763] [INFO] [timer.py:196:stop] epoch=0/micro_step=4560/global_step=4560, RunningAvgSamplesPerSec=17.621998562878264, CurrSamplesPerSec=17.616155687347188, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1563 |
+
[2022-12-19 08:07:38,563] [INFO] [logging.py:68:log_dist] [Rank 0] step=4570, skipped=8, lr=[9.755555555555556e-07], mom=[[0.9, 0.999]]
|
1564 |
+
[2022-12-19 08:07:38,564] [INFO] [timer.py:196:stop] epoch=0/micro_step=4570/global_step=4570, RunningAvgSamplesPerSec=17.621957179552908, CurrSamplesPerSec=17.34689564215522, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1565 |
+
{'loss': 0.0002, 'learning_rate': 9.644444444444444e-07, 'epoch': 39.01}
|
1566 |
+
[2022-12-19 08:10:35,179] [INFO] [logging.py:68:log_dist] [Rank 0] step=4580, skipped=8, lr=[9.533333333333335e-07], mom=[[0.9, 0.999]]
|
1567 |
+
[2022-12-19 08:10:35,180] [INFO] [timer.py:196:stop] epoch=0/micro_step=4580/global_step=4580, RunningAvgSamplesPerSec=17.622211087890516, CurrSamplesPerSec=17.696736200907107, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1568 |
+
[2022-12-19 08:13:29,992] [INFO] [logging.py:68:log_dist] [Rank 0] step=4590, skipped=8, lr=[9.311111111111113e-07], mom=[[0.9, 0.999]]
|
1569 |
+
[2022-12-19 08:13:29,993] [INFO] [timer.py:196:stop] epoch=0/micro_step=4590/global_step=4590, RunningAvgSamplesPerSec=17.62224562573623, CurrSamplesPerSec=17.885243737876667, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1570 |
+
[2022-12-19 08:16:25,912] [INFO] [logging.py:68:log_dist] [Rank 0] step=4600, skipped=8, lr=[9.08888888888889e-07], mom=[[0.9, 0.999]]
|
1571 |
+
[2022-12-19 08:16:25,914] [INFO] [timer.py:196:stop] epoch=0/micro_step=4600/global_step=4600, RunningAvgSamplesPerSec=17.62219766153009, CurrSamplesPerSec=17.437179457948137, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1572 |
+
{'loss': 0.0002, 'learning_rate': 9.08888888888889e-07, 'epoch': 39.02}
|
1573 |
+
[2022-12-19 08:19:22,584] [INFO] [logging.py:68:log_dist] [Rank 0] step=4610, skipped=8, lr=[8.866666666666668e-07], mom=[[0.9, 0.999]]
|
1574 |
+
[2022-12-19 08:19:22,585] [INFO] [timer.py:196:stop] epoch=0/micro_step=4610/global_step=4610, RunningAvgSamplesPerSec=17.622381934268187, CurrSamplesPerSec=17.858060608665216, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1575 |
+
[2022-12-19 08:22:18,644] [INFO] [logging.py:68:log_dist] [Rank 0] step=4620, skipped=8, lr=[8.644444444444445e-07], mom=[[0.9, 0.999]]
|
1576 |
+
[2022-12-19 08:22:18,645] [INFO] [timer.py:196:stop] epoch=0/micro_step=4620/global_step=4620, RunningAvgSamplesPerSec=17.622457478635866, CurrSamplesPerSec=17.62142545329462, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1577 |
+
{'loss': 0.0002, 'learning_rate': 8.533333333333334e-07, 'epoch': 39.02}
|
1578 |
+
[2022-12-19 08:25:19,257] [INFO] [logging.py:68:log_dist] [Rank 0] step=4630, skipped=8, lr=[8.422222222222224e-07], mom=[[0.9, 0.999]]
|
1579 |
+
[2022-12-19 08:25:19,259] [INFO] [timer.py:196:stop] epoch=0/micro_step=4630/global_step=4630, RunningAvgSamplesPerSec=17.62274117333296, CurrSamplesPerSec=17.796734173722378, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1580 |
+
[2022-12-19 08:26:22,516] [INFO] [logging.py:68:log_dist] [Rank 0] step=4640, skipped=8, lr=[8.200000000000001e-07], mom=[[0.9, 0.999]]
|
1581 |
+
[2022-12-19 08:26:22,517] [INFO] [timer.py:196:stop] epoch=0/micro_step=4640/global_step=4640, RunningAvgSamplesPerSec=17.624083237549367, CurrSamplesPerSec=23.52461524938626, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1582 |
+
[2022-12-19 08:31:08,975] [INFO] [logging.py:68:log_dist] [Rank 0] step=4650, skipped=8, lr=[7.977777777777779e-07], mom=[[0.9, 0.999]]
|
1583 |
+
[2022-12-19 08:31:08,977] [INFO] [timer.py:196:stop] epoch=0/micro_step=4650/global_step=4650, RunningAvgSamplesPerSec=17.624288170024037, CurrSamplesPerSec=17.88550113865089, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1584 |
+
{'loss': 0.0002, 'learning_rate': 7.977777777777779e-07, 'epoch': 40.0}
|
1585 |
+
[2022-12-19 08:34:15,146] [INFO] [logging.py:68:log_dist] [Rank 0] step=4660, skipped=8, lr=[7.755555555555556e-07], mom=[[0.9, 0.999]]
|
1586 |
+
[2022-12-19 08:34:15,147] [INFO] [timer.py:196:stop] epoch=0/micro_step=4660/global_step=4660, RunningAvgSamplesPerSec=17.62444471234681, CurrSamplesPerSec=17.69476125609619, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1587 |
+
[2022-12-19 08:37:03,411] [INFO] [logging.py:68:log_dist] [Rank 0] step=4670, skipped=8, lr=[7.533333333333335e-07], mom=[[0.9, 0.999]]
|
1588 |
+
[2022-12-19 08:37:03,413] [INFO] [timer.py:196:stop] epoch=0/micro_step=4670/global_step=4670, RunningAvgSamplesPerSec=17.62448466929025, CurrSamplesPerSec=17.483528705282534, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1589 |
+
{'loss': 0.0002, 'learning_rate': 7.422222222222223e-07, 'epoch': 40.01}
|
1590 |
+
[2022-12-19 08:39:55,845] [INFO] [logging.py:68:log_dist] [Rank 0] step=4680, skipped=8, lr=[7.311111111111112e-07], mom=[[0.9, 0.999]]
|
1591 |
+
[2022-12-19 08:39:55,846] [INFO] [timer.py:196:stop] epoch=0/micro_step=4680/global_step=4680, RunningAvgSamplesPerSec=17.620159783831323, CurrSamplesPerSec=17.60117311667079, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1592 |
+
[2022-12-19 08:42:42,018] [INFO] [logging.py:68:log_dist] [Rank 0] step=4690, skipped=8, lr=[7.08888888888889e-07], mom=[[0.9, 0.999]]
|
1593 |
+
[2022-12-19 08:42:42,019] [INFO] [timer.py:196:stop] epoch=0/micro_step=4690/global_step=4690, RunningAvgSamplesPerSec=17.620279729725365, CurrSamplesPerSec=17.93593334380126, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1594 |
+
[2022-12-19 08:45:27,450] [INFO] [logging.py:68:log_dist] [Rank 0] step=4700, skipped=8, lr=[6.866666666666667e-07], mom=[[0.9, 0.999]]
|
1595 |
+
[2022-12-19 08:45:27,452] [INFO] [timer.py:196:stop] epoch=0/micro_step=4700/global_step=4700, RunningAvgSamplesPerSec=17.620331377429462, CurrSamplesPerSec=17.754938283281913, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1596 |
+
{'loss': 0.0002, 'learning_rate': 6.866666666666667e-07, 'epoch': 40.01}
|
1597 |
+
[2022-12-19 08:48:13,020] [INFO] [logging.py:68:log_dist] [Rank 0] step=4710, skipped=8, lr=[6.644444444444446e-07], mom=[[0.9, 0.999]]
|
1598 |
+
[2022-12-19 08:48:13,022] [INFO] [timer.py:196:stop] epoch=0/micro_step=4710/global_step=4710, RunningAvgSamplesPerSec=17.620347520092913, CurrSamplesPerSec=17.775257844597622, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1599 |
+
[2022-12-19 08:50:59,205] [INFO] [logging.py:68:log_dist] [Rank 0] step=4720, skipped=8, lr=[6.422222222222223e-07], mom=[[0.9, 0.999]]
|
1600 |
+
[2022-12-19 08:50:59,207] [INFO] [timer.py:196:stop] epoch=0/micro_step=4720/global_step=4720, RunningAvgSamplesPerSec=17.62027305945099, CurrSamplesPerSec=17.454496273463086, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1601 |
+
{'loss': 0.0002, 'learning_rate': 6.311111111111112e-07, 'epoch': 40.02}
|
1602 |
+
[2022-12-19 08:53:45,256] [INFO] [logging.py:68:log_dist] [Rank 0] step=4730, skipped=8, lr=[6.200000000000001e-07], mom=[[0.9, 0.999]]
|
1603 |
+
[2022-12-19 08:53:45,258] [INFO] [timer.py:196:stop] epoch=0/micro_step=4730/global_step=4730, RunningAvgSamplesPerSec=17.620074181001, CurrSamplesPerSec=17.436431911964664, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1604 |
+
[2022-12-19 08:56:31,906] [INFO] [logging.py:68:log_dist] [Rank 0] step=4740, skipped=8, lr=[5.977777777777778e-07], mom=[[0.9, 0.999]]
|
1605 |
+
[2022-12-19 08:56:31,908] [INFO] [timer.py:196:stop] epoch=0/micro_step=4740/global_step=4740, RunningAvgSamplesPerSec=17.620138380065285, CurrSamplesPerSec=17.718109635002346, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1606 |
+
[2022-12-19 08:58:49,423] [INFO] [logging.py:68:log_dist] [Rank 0] step=4750, skipped=8, lr=[5.755555555555555e-07], mom=[[0.9, 0.999]]
|
1607 |
+
[2022-12-19 08:58:49,424] [INFO] [timer.py:196:stop] epoch=0/micro_step=4750/global_step=4750, RunningAvgSamplesPerSec=17.620258476386315, CurrSamplesPerSec=17.843816764218595, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1608 |
+
{'loss': 0.0002, 'learning_rate': 5.755555555555555e-07, 'epoch': 40.02}
|
1609 |
+
[2022-12-19 09:02:02,128] [INFO] [logging.py:68:log_dist] [Rank 0] step=4760, skipped=8, lr=[5.533333333333334e-07], mom=[[0.9, 0.999]]
|
1610 |
+
[2022-12-19 09:02:02,129] [INFO] [timer.py:196:stop] epoch=0/micro_step=4760/global_step=4760, RunningAvgSamplesPerSec=17.620845234938383, CurrSamplesPerSec=17.56852084543164, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1611 |
+
[2022-12-19 09:04:49,552] [INFO] [logging.py:68:log_dist] [Rank 0] step=4770, skipped=8, lr=[5.311111111111111e-07], mom=[[0.9, 0.999]]
|
1612 |
+
[2022-12-19 09:04:49,553] [INFO] [timer.py:196:stop] epoch=0/micro_step=4770/global_step=4770, RunningAvgSamplesPerSec=17.620781538403822, CurrSamplesPerSec=17.993237206011308, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1613 |
+
{'loss': 0.0002, 'learning_rate': 5.2e-07, 'epoch': 41.0}
|
1614 |
+
[2022-12-19 09:07:39,410] [INFO] [logging.py:68:log_dist] [Rank 0] step=4780, skipped=8, lr=[5.088888888888889e-07], mom=[[0.9, 0.999]]
|
1615 |
+
[2022-12-19 09:07:39,411] [INFO] [timer.py:196:stop] epoch=0/micro_step=4780/global_step=4780, RunningAvgSamplesPerSec=17.620741159717415, CurrSamplesPerSec=17.62931691282607, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1616 |
+
[2022-12-19 09:10:31,165] [INFO] [logging.py:68:log_dist] [Rank 0] step=4790, skipped=8, lr=[4.866666666666666e-07], mom=[[0.9, 0.999]]
|
1617 |
+
[2022-12-19 09:10:31,166] [INFO] [timer.py:196:stop] epoch=0/micro_step=4790/global_step=4790, RunningAvgSamplesPerSec=17.621122717656146, CurrSamplesPerSec=17.9595660728393, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1618 |
+
[2022-12-19 09:13:22,998] [INFO] [logging.py:68:log_dist] [Rank 0] step=4800, skipped=8, lr=[4.6444444444444446e-07], mom=[[0.9, 0.999]]
|
1619 |
+
[2022-12-19 09:13:23,000] [INFO] [timer.py:196:stop] epoch=0/micro_step=4800/global_step=4800, RunningAvgSamplesPerSec=17.62146184710049, CurrSamplesPerSec=17.68574021221271, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1620 |
+
{'loss': 0.0002, 'learning_rate': 4.6444444444444446e-07, 'epoch': 41.01}
|
1621 |
+
[2022-12-19 09:16:20,425] [INFO] [logging.py:68:log_dist] [Rank 0] step=4810, skipped=8, lr=[4.422222222222223e-07], mom=[[0.9, 0.999]]
|
1622 |
+
[2022-12-19 09:16:20,426] [INFO] [timer.py:196:stop] epoch=0/micro_step=4810/global_step=4810, RunningAvgSamplesPerSec=17.62207009255359, CurrSamplesPerSec=17.919497743569675, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1623 |
+
[2022-12-19 09:19:17,092] [INFO] [logging.py:68:log_dist] [Rank 0] step=4820, skipped=8, lr=[4.2000000000000006e-07], mom=[[0.9, 0.999]]
|
1624 |
+
[2022-12-19 09:19:17,093] [INFO] [timer.py:196:stop] epoch=0/micro_step=4820/global_step=4820, RunningAvgSamplesPerSec=17.62281812490437, CurrSamplesPerSec=18.063956802967372, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1625 |
+
{'loss': 0.0002, 'learning_rate': 4.0888888888888897e-07, 'epoch': 41.01}
|
1626 |
+
[2022-12-19 09:22:11,760] [INFO] [logging.py:68:log_dist] [Rank 0] step=4830, skipped=8, lr=[3.9777777777777783e-07], mom=[[0.9, 0.999]]
|
1627 |
+
[2022-12-19 09:22:11,762] [INFO] [timer.py:196:stop] epoch=0/micro_step=4830/global_step=4830, RunningAvgSamplesPerSec=17.623217566897882, CurrSamplesPerSec=17.682097330909283, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1628 |
+
[2022-12-19 09:25:06,256] [INFO] [logging.py:68:log_dist] [Rank 0] step=4840, skipped=8, lr=[3.755555555555556e-07], mom=[[0.9, 0.999]]
|
1629 |
+
[2022-12-19 09:25:06,258] [INFO] [timer.py:196:stop] epoch=0/micro_step=4840/global_step=4840, RunningAvgSamplesPerSec=17.6239823084841, CurrSamplesPerSec=17.937164202870864, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1630 |
+
[2022-12-19 09:28:01,375] [INFO] [logging.py:68:log_dist] [Rank 0] step=4850, skipped=8, lr=[3.533333333333334e-07], mom=[[0.9, 0.999]]
|
1631 |
+
[2022-12-19 09:28:01,376] [INFO] [timer.py:196:stop] epoch=0/micro_step=4850/global_step=4850, RunningAvgSamplesPerSec=17.624502174178428, CurrSamplesPerSec=17.980924250155837, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1632 |
+
{'loss': 0.0002, 'learning_rate': 3.533333333333334e-07, 'epoch': 41.02}
|
1633 |
+
[2022-12-19 09:30:55,163] [INFO] [logging.py:68:log_dist] [Rank 0] step=4860, skipped=8, lr=[3.3111111111111115e-07], mom=[[0.9, 0.999]]
|
1634 |
+
[2022-12-19 09:30:55,164] [INFO] [timer.py:196:stop] epoch=0/micro_step=4860/global_step=4860, RunningAvgSamplesPerSec=17.62510883090857, CurrSamplesPerSec=17.68973547651362, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1635 |
+
[2022-12-19 09:32:25,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=4870, skipped=8, lr=[3.088888888888889e-07], mom=[[0.9, 0.999]]
|
1636 |
+
[2022-12-19 09:32:25,740] [INFO] [timer.py:196:stop] epoch=0/micro_step=4870/global_step=4870, RunningAvgSamplesPerSec=17.6258822208131, CurrSamplesPerSec=18.0027099754836, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1637 |
+
{'loss': 0.0002, 'learning_rate': 2.977777777777778e-07, 'epoch': 42.0}
|
1638 |
+
[2022-12-19 09:36:36,623] [INFO] [logging.py:68:log_dist] [Rank 0] step=4880, skipped=8, lr=[2.866666666666667e-07], mom=[[0.9, 0.999]]
|
1639 |
+
[2022-12-19 09:36:36,625] [INFO] [timer.py:196:stop] epoch=0/micro_step=4880/global_step=4880, RunningAvgSamplesPerSec=17.627590163724953, CurrSamplesPerSec=17.970300168807466, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1640 |
+
[2022-12-19 09:39:29,570] [INFO] [logging.py:68:log_dist] [Rank 0] step=4890, skipped=8, lr=[2.6444444444444447e-07], mom=[[0.9, 0.999]]
|
1641 |
+
[2022-12-19 09:39:29,571] [INFO] [timer.py:196:stop] epoch=0/micro_step=4890/global_step=4890, RunningAvgSamplesPerSec=17.628267118175295, CurrSamplesPerSec=17.953751119369517, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1642 |
+
[2022-12-19 09:42:20,232] [INFO] [logging.py:68:log_dist] [Rank 0] step=4900, skipped=8, lr=[2.4222222222222224e-07], mom=[[0.9, 0.999]]
|
1643 |
+
[2022-12-19 09:42:20,233] [INFO] [timer.py:196:stop] epoch=0/micro_step=4900/global_step=4900, RunningAvgSamplesPerSec=17.628906943630938, CurrSamplesPerSec=17.925920168070032, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1644 |
+
{'loss': 0.0002, 'learning_rate': 2.4222222222222224e-07, 'epoch': 42.01}
|
1645 |
+
[2022-12-19 09:45:10,151] [INFO] [logging.py:68:log_dist] [Rank 0] step=4910, skipped=8, lr=[2.2e-07], mom=[[0.9, 0.999]]
|
1646 |
+
[2022-12-19 09:45:10,152] [INFO] [timer.py:196:stop] epoch=0/micro_step=4910/global_step=4910, RunningAvgSamplesPerSec=17.629482874220002, CurrSamplesPerSec=17.81305045026094, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1647 |
+
[2022-12-19 09:47:55,909] [INFO] [logging.py:68:log_dist] [Rank 0] step=4920, skipped=8, lr=[1.9777777777777778e-07], mom=[[0.9, 0.999]]
|
1648 |
+
[2022-12-19 09:47:55,910] [INFO] [timer.py:196:stop] epoch=0/micro_step=4920/global_step=4920, RunningAvgSamplesPerSec=17.629964973280497, CurrSamplesPerSec=17.82848486728267, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1649 |
+
{'loss': 0.0002, 'learning_rate': 1.866666666666667e-07, 'epoch': 42.01}
|
1650 |
+
[2022-12-19 09:50:45,106] [INFO] [logging.py:68:log_dist] [Rank 0] step=4930, skipped=8, lr=[1.7555555555555558e-07], mom=[[0.9, 0.999]]
|
1651 |
+
[2022-12-19 09:50:45,108] [INFO] [timer.py:196:stop] epoch=0/micro_step=4930/global_step=4930, RunningAvgSamplesPerSec=17.630395440620926, CurrSamplesPerSec=17.707026340076307, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1652 |
+
[2022-12-19 09:53:36,465] [INFO] [logging.py:68:log_dist] [Rank 0] step=4940, skipped=8, lr=[1.5333333333333333e-07], mom=[[0.9, 0.999]]
|
1653 |
+
[2022-12-19 09:53:36,466] [INFO] [timer.py:196:stop] epoch=0/micro_step=4940/global_step=4940, RunningAvgSamplesPerSec=17.631168427205882, CurrSamplesPerSec=18.003925866621273, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1654 |
+
[2022-12-19 09:56:31,916] [INFO] [logging.py:68:log_dist] [Rank 0] step=4950, skipped=8, lr=[1.3111111111111113e-07], mom=[[0.9, 0.999]]
|
1655 |
+
[2022-12-19 09:56:31,917] [INFO] [timer.py:196:stop] epoch=0/micro_step=4950/global_step=4950, RunningAvgSamplesPerSec=17.63176081163297, CurrSamplesPerSec=18.076746529204566, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1656 |
+
{'loss': 0.0002, 'learning_rate': 1.3111111111111113e-07, 'epoch': 42.02}
|
1657 |
+
[2022-12-19 09:59:24,350] [INFO] [logging.py:68:log_dist] [Rank 0] step=4960, skipped=8, lr=[1.088888888888889e-07], mom=[[0.9, 0.999]]
|
1658 |
+
[2022-12-19 09:59:24,352] [INFO] [timer.py:196:stop] epoch=0/micro_step=4960/global_step=4960, RunningAvgSamplesPerSec=17.632263561334792, CurrSamplesPerSec=17.887386586858412, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1659 |
+
[2022-12-19 10:02:21,261] [INFO] [logging.py:68:log_dist] [Rank 0] step=4970, skipped=8, lr=[8.666666666666668e-08], mom=[[0.9, 0.999]]
|
1660 |
+
[2022-12-19 10:02:21,262] [INFO] [timer.py:196:stop] epoch=0/micro_step=4970/global_step=4970, RunningAvgSamplesPerSec=17.632836729122953, CurrSamplesPerSec=18.04230886853236, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1661 |
+
{'loss': 0.0002, 'learning_rate': 7.555555555555556e-08, 'epoch': 42.02}
|
1662 |
+
[2022-12-19 10:05:14,832] [INFO] [logging.py:68:log_dist] [Rank 0] step=4980, skipped=8, lr=[6.444444444444445e-08], mom=[[0.9, 0.999]]
|
1663 |
+
[2022-12-19 10:05:14,834] [INFO] [timer.py:196:stop] epoch=0/micro_step=4980/global_step=4980, RunningAvgSamplesPerSec=17.633470780183274, CurrSamplesPerSec=17.935921359635877, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1664 |
+
[2022-12-19 10:08:03,799] [INFO] [logging.py:68:log_dist] [Rank 0] step=4990, skipped=8, lr=[4.222222222222222e-08], mom=[[0.9, 0.999]]
|
1665 |
+
[2022-12-19 10:08:03,800] [INFO] [timer.py:196:stop] epoch=0/micro_step=4990/global_step=4990, RunningAvgSamplesPerSec=17.635077496959436, CurrSamplesPerSec=18.20354202004553, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1666 |
+
[2022-12-19 10:10:57,893] [INFO] [logging.py:68:log_dist] [Rank 0] step=5000, skipped=8, lr=[2e-08], mom=[[0.9, 0.999]]
|
1667 |
+
[2022-12-19 10:10:57,895] [INFO] [timer.py:196:stop] epoch=0/micro_step=5000/global_step=5000, RunningAvgSamplesPerSec=17.635593279087654, CurrSamplesPerSec=18.094030560717506, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1668 |
+
{'loss': 0.0002, 'learning_rate': 2e-08, 'epoch': 43.0}
|
1669 |
+
{'eval_loss': 0.3427734375, 'eval_wer': 17.804826268487723, 'eval_runtime': 1211.7822, 'eval_samples_per_second': 3.185, 'eval_steps_per_second': 0.1, 'epoch': 43.0}
|
1670 |
+
[2022-12-19 10:31:10,745] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step5000 is begin to save!
|
1671 |
+
[2022-12-19 10:31:10,753] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt
|
1672 |
+
[2022-12-19 10:31:10,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt...
|
1673 |
+
[2022-12-19 10:31:11,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt.
|
1674 |
+
[2022-12-19 10:31:11,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
|
1675 |
+
[2022-12-19 10:31:15,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1676 |
+
[2022-12-19 10:31:15,939] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1677 |
+
[2022-12-19 10:31:15,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now!
|
runs/Dec18_08-41-04_fe2747a042f0/events.out.tfevents.1671381730.fe2747a042f0.46148.0
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8902c3935bdd48437262f5153bb5bdab3bf7777a2e2193fb19c4db3fc98b8f31
|
3 |
+
size 37251
|