mikr commited on
Commit
9101f9e
1 Parent(s): 76d5e87

Training in progress, step 4000

Browse files
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ea12963d9010f93d1d623d36d7d6714c0380c88c623ef986bc7350778e41f714
3
  size 483536061
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51a199ee5e10cdb49b6781596fab3d076f0c14f34e7f0a0212b16834cbb296c6
3
  size 483536061
run.log CHANGED
@@ -1173,3 +1173,254 @@ Rank: 0 partition count [1] and sizes[(241734912, False)]
1173
  [2022-12-19 00:04:18,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1174
  [2022-12-19 00:04:18,797] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1175
  [2022-12-19 00:04:18,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1173
  [2022-12-19 00:04:18,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1174
  [2022-12-19 00:04:18,797] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1175
  [2022-12-19 00:04:18,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now!
1176
+ [2022-12-19 00:06:55,197] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 65536.0
1177
+ [2022-12-19 00:07:13,410] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
1178
+ [2022-12-19 00:07:53,402] [INFO] [logging.py:68:log_dist] [Rank 0] step=3010, skipped=6, lr=[4.437777777777778e-06], mom=[[0.9, 0.999]]
1179
+ [2022-12-19 00:07:53,403] [INFO] [timer.py:196:stop] epoch=0/micro_step=3010/global_step=3010, RunningAvgSamplesPerSec=17.591068861221267, CurrSamplesPerSec=17.764448583346613, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1180
+ [2022-12-19 00:11:06,978] [INFO] [logging.py:68:log_dist] [Rank 0] step=3020, skipped=6, lr=[4.415555555555556e-06], mom=[[0.9, 0.999]]
1181
+ [2022-12-19 00:11:06,980] [INFO] [timer.py:196:stop] epoch=0/micro_step=3020/global_step=3020, RunningAvgSamplesPerSec=17.592794094238695, CurrSamplesPerSec=17.57114513059809, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1182
+ {'loss': 0.0003, 'learning_rate': 4.404444444444445e-06, 'epoch': 26.0}
1183
+ [2022-12-19 00:14:04,963] [INFO] [logging.py:68:log_dist] [Rank 0] step=3030, skipped=6, lr=[4.393333333333334e-06], mom=[[0.9, 0.999]]
1184
+ [2022-12-19 00:14:04,964] [INFO] [timer.py:196:stop] epoch=0/micro_step=3030/global_step=3030, RunningAvgSamplesPerSec=17.593074428336145, CurrSamplesPerSec=17.549051816080418, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1185
+ [2022-12-19 00:16:56,281] [INFO] [logging.py:68:log_dist] [Rank 0] step=3040, skipped=6, lr=[4.371111111111112e-06], mom=[[0.9, 0.999]]
1186
+ [2022-12-19 00:16:56,282] [INFO] [timer.py:196:stop] epoch=0/micro_step=3040/global_step=3040, RunningAvgSamplesPerSec=17.593342880204634, CurrSamplesPerSec=17.68683907958883, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1187
+ [2022-12-19 00:19:47,359] [INFO] [logging.py:68:log_dist] [Rank 0] step=3050, skipped=6, lr=[4.348888888888889e-06], mom=[[0.9, 0.999]]
1188
+ [2022-12-19 00:19:47,360] [INFO] [timer.py:196:stop] epoch=0/micro_step=3050/global_step=3050, RunningAvgSamplesPerSec=17.593460947035926, CurrSamplesPerSec=17.70367941895282, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1189
+ {'loss': 0.0003, 'learning_rate': 4.348888888888889e-06, 'epoch': 26.01}
1190
+ [2022-12-19 00:22:42,032] [INFO] [logging.py:68:log_dist] [Rank 0] step=3060, skipped=6, lr=[4.326666666666667e-06], mom=[[0.9, 0.999]]
1191
+ [2022-12-19 00:22:42,034] [INFO] [timer.py:196:stop] epoch=0/micro_step=3060/global_step=3060, RunningAvgSamplesPerSec=17.59375044318361, CurrSamplesPerSec=17.384547334188284, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1192
+ [2022-12-19 00:25:34,669] [INFO] [logging.py:68:log_dist] [Rank 0] step=3070, skipped=6, lr=[4.304444444444445e-06], mom=[[0.9, 0.999]]
1193
+ [2022-12-19 00:25:34,671] [INFO] [timer.py:196:stop] epoch=0/micro_step=3070/global_step=3070, RunningAvgSamplesPerSec=17.5937122892984, CurrSamplesPerSec=17.613768735779285, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1194
+ {'loss': 0.0003, 'learning_rate': 4.2933333333333334e-06, 'epoch': 26.01}
1195
+ [2022-12-19 00:28:25,299] [INFO] [logging.py:68:log_dist] [Rank 0] step=3080, skipped=6, lr=[4.282222222222222e-06], mom=[[0.9, 0.999]]
1196
+ [2022-12-19 00:28:25,300] [INFO] [timer.py:196:stop] epoch=0/micro_step=3080/global_step=3080, RunningAvgSamplesPerSec=17.593914468788185, CurrSamplesPerSec=17.718895564207596, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1197
+ [2022-12-19 00:31:21,700] [INFO] [logging.py:68:log_dist] [Rank 0] step=3090, skipped=6, lr=[4.26e-06], mom=[[0.9, 0.999]]
1198
+ [2022-12-19 00:31:21,701] [INFO] [timer.py:196:stop] epoch=0/micro_step=3090/global_step=3090, RunningAvgSamplesPerSec=17.593445388805577, CurrSamplesPerSec=17.325692189061847, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1199
+ [2022-12-19 00:34:16,335] [INFO] [logging.py:68:log_dist] [Rank 0] step=3100, skipped=6, lr=[4.2377777777777775e-06], mom=[[0.9, 0.999]]
1200
+ [2022-12-19 00:34:16,336] [INFO] [timer.py:196:stop] epoch=0/micro_step=3100/global_step=3100, RunningAvgSamplesPerSec=17.59348166098487, CurrSamplesPerSec=17.3108812570425, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1201
+ {'loss': 0.0003, 'learning_rate': 4.2377777777777775e-06, 'epoch': 26.02}
1202
+ [2022-12-19 00:37:12,056] [INFO] [logging.py:68:log_dist] [Rank 0] step=3110, skipped=6, lr=[4.215555555555556e-06], mom=[[0.9, 0.999]]
1203
+ [2022-12-19 00:37:12,058] [INFO] [timer.py:196:stop] epoch=0/micro_step=3110/global_step=3110, RunningAvgSamplesPerSec=17.593623928588435, CurrSamplesPerSec=17.863522480875613, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1204
+ [2022-12-19 00:40:07,743] [INFO] [logging.py:68:log_dist] [Rank 0] step=3120, skipped=6, lr=[4.1933333333333336e-06], mom=[[0.9, 0.999]]
1205
+ [2022-12-19 00:40:07,744] [INFO] [timer.py:196:stop] epoch=0/micro_step=3120/global_step=3120, RunningAvgSamplesPerSec=17.593697302631174, CurrSamplesPerSec=17.08051252585759, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1206
+ {'loss': 0.0003, 'learning_rate': 4.182222222222222e-06, 'epoch': 26.02}
1207
+ [2022-12-19 00:41:38,634] [INFO] [logging.py:68:log_dist] [Rank 0] step=3130, skipped=6, lr=[4.171111111111111e-06], mom=[[0.9, 0.999]]
1208
+ [2022-12-19 00:41:38,636] [INFO] [timer.py:196:stop] epoch=0/micro_step=3130/global_step=3130, RunningAvgSamplesPerSec=17.593999749820743, CurrSamplesPerSec=17.741973511374713, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1209
+ [2022-12-19 00:45:51,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=3140, skipped=6, lr=[4.148888888888889e-06], mom=[[0.9, 0.999]]
1210
+ [2022-12-19 00:45:51,865] [INFO] [timer.py:196:stop] epoch=0/micro_step=3140/global_step=3140, RunningAvgSamplesPerSec=17.595327756966622, CurrSamplesPerSec=17.445340938254706, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1211
+ [2022-12-19 00:48:49,392] [INFO] [logging.py:68:log_dist] [Rank 0] step=3150, skipped=6, lr=[4.126666666666667e-06], mom=[[0.9, 0.999]]
1212
+ [2022-12-19 00:48:49,394] [INFO] [timer.py:196:stop] epoch=0/micro_step=3150/global_step=3150, RunningAvgSamplesPerSec=17.595551909670867, CurrSamplesPerSec=17.525080176118156, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1213
+ {'loss': 0.0003, 'learning_rate': 4.126666666666667e-06, 'epoch': 27.0}
1214
+ [2022-12-19 00:51:44,912] [INFO] [logging.py:68:log_dist] [Rank 0] step=3160, skipped=6, lr=[4.104444444444445e-06], mom=[[0.9, 0.999]]
1215
+ [2022-12-19 00:51:44,914] [INFO] [timer.py:196:stop] epoch=0/micro_step=3160/global_step=3160, RunningAvgSamplesPerSec=17.59533705410767, CurrSamplesPerSec=17.492849569178176, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1216
+ [2022-12-19 00:54:39,253] [INFO] [logging.py:68:log_dist] [Rank 0] step=3170, skipped=6, lr=[4.0822222222222225e-06], mom=[[0.9, 0.999]]
1217
+ [2022-12-19 00:54:39,254] [INFO] [timer.py:196:stop] epoch=0/micro_step=3170/global_step=3170, RunningAvgSamplesPerSec=17.595644102704394, CurrSamplesPerSec=17.775633328909876, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1218
+ {'loss': 0.0003, 'learning_rate': 4.071111111111111e-06, 'epoch': 27.01}
1219
+ [2022-12-19 00:57:32,029] [INFO] [logging.py:68:log_dist] [Rank 0] step=3180, skipped=6, lr=[4.060000000000001e-06], mom=[[0.9, 0.999]]
1220
+ [2022-12-19 00:57:32,031] [INFO] [timer.py:196:stop] epoch=0/micro_step=3180/global_step=3180, RunningAvgSamplesPerSec=17.59573223021548, CurrSamplesPerSec=17.551840136671807, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1221
+ [2022-12-19 01:00:26,835] [INFO] [logging.py:68:log_dist] [Rank 0] step=3190, skipped=6, lr=[4.0377777777777786e-06], mom=[[0.9, 0.999]]
1222
+ [2022-12-19 01:00:26,836] [INFO] [timer.py:196:stop] epoch=0/micro_step=3190/global_step=3190, RunningAvgSamplesPerSec=17.59571717237336, CurrSamplesPerSec=17.490995086617858, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1223
+ [2022-12-19 01:03:19,686] [INFO] [logging.py:68:log_dist] [Rank 0] step=3200, skipped=6, lr=[4.015555555555556e-06], mom=[[0.9, 0.999]]
1224
+ [2022-12-19 01:03:19,688] [INFO] [timer.py:196:stop] epoch=0/micro_step=3200/global_step=3200, RunningAvgSamplesPerSec=17.59580399188835, CurrSamplesPerSec=17.34269068867683, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1225
+ {'loss': 0.0003, 'learning_rate': 4.015555555555556e-06, 'epoch': 27.01}
1226
+ [2022-12-19 01:06:14,654] [INFO] [logging.py:68:log_dist] [Rank 0] step=3210, skipped=6, lr=[3.993333333333334e-06], mom=[[0.9, 0.999]]
1227
+ [2022-12-19 01:06:14,655] [INFO] [timer.py:196:stop] epoch=0/micro_step=3210/global_step=3210, RunningAvgSamplesPerSec=17.596046138360002, CurrSamplesPerSec=17.725228814430604, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1228
+ [2022-12-19 01:09:07,511] [INFO] [logging.py:68:log_dist] [Rank 0] step=3220, skipped=6, lr=[3.971111111111111e-06], mom=[[0.9, 0.999]]
1229
+ [2022-12-19 01:09:07,512] [INFO] [timer.py:196:stop] epoch=0/micro_step=3220/global_step=3220, RunningAvgSamplesPerSec=17.59647348251674, CurrSamplesPerSec=17.602937912133203, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1230
+ {'loss': 0.0003, 'learning_rate': 3.96e-06, 'epoch': 27.02}
1231
+ [2022-12-19 01:12:01,533] [INFO] [logging.py:68:log_dist] [Rank 0] step=3230, skipped=6, lr=[3.948888888888889e-06], mom=[[0.9, 0.999]]
1232
+ [2022-12-19 01:12:01,535] [INFO] [timer.py:196:stop] epoch=0/micro_step=3230/global_step=3230, RunningAvgSamplesPerSec=17.596052986386066, CurrSamplesPerSec=17.566659481255172, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1233
+ [2022-12-19 01:14:55,505] [INFO] [logging.py:68:log_dist] [Rank 0] step=3240, skipped=6, lr=[3.926666666666667e-06], mom=[[0.9, 0.999]]
1234
+ [2022-12-19 01:14:55,507] [INFO] [timer.py:196:stop] epoch=0/micro_step=3240/global_step=3240, RunningAvgSamplesPerSec=17.596036037139132, CurrSamplesPerSec=17.498455300383174, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1235
+ [2022-12-19 01:17:42,447] [INFO] [logging.py:68:log_dist] [Rank 0] step=3250, skipped=6, lr=[3.904444444444444e-06], mom=[[0.9, 0.999]]
1236
+ [2022-12-19 01:17:42,449] [INFO] [timer.py:196:stop] epoch=0/micro_step=3250/global_step=3250, RunningAvgSamplesPerSec=17.597616914434525, CurrSamplesPerSec=17.628235600063963, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1237
+ {'loss': 0.0003, 'learning_rate': 3.904444444444444e-06, 'epoch': 28.0}
1238
+ [2022-12-19 01:20:40,342] [INFO] [logging.py:68:log_dist] [Rank 0] step=3260, skipped=6, lr=[3.882222222222223e-06], mom=[[0.9, 0.999]]
1239
+ [2022-12-19 01:20:40,344] [INFO] [timer.py:196:stop] epoch=0/micro_step=3260/global_step=3260, RunningAvgSamplesPerSec=17.59767515749172, CurrSamplesPerSec=17.64395546265293, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1240
+ [2022-12-19 01:23:41,372] [INFO] [logging.py:68:log_dist] [Rank 0] step=3270, skipped=6, lr=[3.86e-06], mom=[[0.9, 0.999]]
1241
+ [2022-12-19 01:23:41,373] [INFO] [timer.py:196:stop] epoch=0/micro_step=3270/global_step=3270, RunningAvgSamplesPerSec=17.597920364006136, CurrSamplesPerSec=17.849487124433125, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1242
+ {'loss': 0.0003, 'learning_rate': 3.848888888888889e-06, 'epoch': 28.01}
1243
+ [2022-12-19 01:26:34,140] [INFO] [logging.py:68:log_dist] [Rank 0] step=3280, skipped=6, lr=[3.837777777777778e-06], mom=[[0.9, 0.999]]
1244
+ [2022-12-19 01:26:34,141] [INFO] [timer.py:196:stop] epoch=0/micro_step=3280/global_step=3280, RunningAvgSamplesPerSec=17.597884174195094, CurrSamplesPerSec=17.131544523208934, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1245
+ [2022-12-19 01:29:24,859] [INFO] [logging.py:68:log_dist] [Rank 0] step=3290, skipped=6, lr=[3.8155555555555555e-06], mom=[[0.9, 0.999]]
1246
+ [2022-12-19 01:29:24,861] [INFO] [timer.py:196:stop] epoch=0/micro_step=3290/global_step=3290, RunningAvgSamplesPerSec=17.597922188226125, CurrSamplesPerSec=17.589932443660675, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1247
+ [2022-12-19 01:32:24,242] [INFO] [logging.py:68:log_dist] [Rank 0] step=3300, skipped=6, lr=[3.793333333333334e-06], mom=[[0.9, 0.999]]
1248
+ [2022-12-19 01:32:24,244] [INFO] [timer.py:196:stop] epoch=0/micro_step=3300/global_step=3300, RunningAvgSamplesPerSec=17.59799470754608, CurrSamplesPerSec=17.711194827166782, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1249
+ {'loss': 0.0003, 'learning_rate': 3.793333333333334e-06, 'epoch': 28.01}
1250
+ [2022-12-19 01:35:16,557] [INFO] [logging.py:68:log_dist] [Rank 0] step=3310, skipped=6, lr=[3.7711111111111116e-06], mom=[[0.9, 0.999]]
1251
+ [2022-12-19 01:35:16,558] [INFO] [timer.py:196:stop] epoch=0/micro_step=3310/global_step=3310, RunningAvgSamplesPerSec=17.598131006457837, CurrSamplesPerSec=17.669246821453136, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1252
+ [2022-12-19 01:38:09,745] [INFO] [logging.py:68:log_dist] [Rank 0] step=3320, skipped=6, lr=[3.7488888888888892e-06], mom=[[0.9, 0.999]]
1253
+ [2022-12-19 01:38:09,747] [INFO] [timer.py:196:stop] epoch=0/micro_step=3320/global_step=3320, RunningAvgSamplesPerSec=17.59810478665201, CurrSamplesPerSec=17.712917471280523, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1254
+ {'loss': 0.0003, 'learning_rate': 3.737777777777778e-06, 'epoch': 28.02}
1255
+ [2022-12-19 01:40:59,612] [INFO] [logging.py:68:log_dist] [Rank 0] step=3330, skipped=6, lr=[3.726666666666667e-06], mom=[[0.9, 0.999]]
1256
+ [2022-12-19 01:40:59,613] [INFO] [timer.py:196:stop] epoch=0/micro_step=3330/global_step=3330, RunningAvgSamplesPerSec=17.5981314985267, CurrSamplesPerSec=17.433335941926863, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1257
+ [2022-12-19 01:43:50,657] [INFO] [logging.py:68:log_dist] [Rank 0] step=3340, skipped=6, lr=[3.704444444444445e-06], mom=[[0.9, 0.999]]
1258
+ [2022-12-19 01:43:50,659] [INFO] [timer.py:196:stop] epoch=0/micro_step=3340/global_step=3340, RunningAvgSamplesPerSec=17.59849399704335, CurrSamplesPerSec=17.73495453276351, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1259
+ [2022-12-19 01:46:42,496] [INFO] [logging.py:68:log_dist] [Rank 0] step=3350, skipped=6, lr=[3.6822222222222225e-06], mom=[[0.9, 0.999]]
1260
+ [2022-12-19 01:46:42,497] [INFO] [timer.py:196:stop] epoch=0/micro_step=3350/global_step=3350, RunningAvgSamplesPerSec=17.598006416218745, CurrSamplesPerSec=16.950657243750207, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1261
+ {'loss': 0.0003, 'learning_rate': 3.6822222222222225e-06, 'epoch': 28.02}
1262
+ [2022-12-19 01:48:39,809] [INFO] [logging.py:68:log_dist] [Rank 0] step=3360, skipped=6, lr=[3.66e-06], mom=[[0.9, 0.999]]
1263
+ [2022-12-19 01:48:39,811] [INFO] [timer.py:196:stop] epoch=0/micro_step=3360/global_step=3360, RunningAvgSamplesPerSec=17.59821946036748, CurrSamplesPerSec=17.671893143144626, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1264
+ [2022-12-19 01:52:23,796] [INFO] [logging.py:68:log_dist] [Rank 0] step=3370, skipped=6, lr=[3.6377777777777777e-06], mom=[[0.9, 0.999]]
1265
+ [2022-12-19 01:52:23,797] [INFO] [timer.py:196:stop] epoch=0/micro_step=3370/global_step=3370, RunningAvgSamplesPerSec=17.599458275412836, CurrSamplesPerSec=17.74689643878602, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1266
+ {'loss': 0.0003, 'learning_rate': 3.6266666666666674e-06, 'epoch': 29.0}
1267
+ [2022-12-19 01:55:17,361] [INFO] [logging.py:68:log_dist] [Rank 0] step=3380, skipped=6, lr=[3.615555555555556e-06], mom=[[0.9, 0.999]]
1268
+ [2022-12-19 01:55:17,362] [INFO] [timer.py:196:stop] epoch=0/micro_step=3380/global_step=3380, RunningAvgSamplesPerSec=17.599580206175794, CurrSamplesPerSec=17.84562699614934, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1269
+ [2022-12-19 01:58:09,807] [INFO] [logging.py:68:log_dist] [Rank 0] step=3390, skipped=6, lr=[3.593333333333334e-06], mom=[[0.9, 0.999]]
1270
+ [2022-12-19 01:58:09,809] [INFO] [timer.py:196:stop] epoch=0/micro_step=3390/global_step=3390, RunningAvgSamplesPerSec=17.599584006196714, CurrSamplesPerSec=17.690764887441887, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1271
+ [2022-12-19 02:01:00,280] [INFO] [logging.py:68:log_dist] [Rank 0] step=3400, skipped=6, lr=[3.5711111111111114e-06], mom=[[0.9, 0.999]]
1272
+ [2022-12-19 02:01:00,282] [INFO] [timer.py:196:stop] epoch=0/micro_step=3400/global_step=3400, RunningAvgSamplesPerSec=17.599061452055388, CurrSamplesPerSec=17.68211014301976, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1273
+ {'loss': 0.0003, 'learning_rate': 3.5711111111111114e-06, 'epoch': 29.01}
1274
+ [2022-12-19 02:03:54,109] [INFO] [logging.py:68:log_dist] [Rank 0] step=3410, skipped=6, lr=[3.548888888888889e-06], mom=[[0.9, 0.999]]
1275
+ [2022-12-19 02:03:54,110] [INFO] [timer.py:196:stop] epoch=0/micro_step=3410/global_step=3410, RunningAvgSamplesPerSec=17.599020297195587, CurrSamplesPerSec=17.79227531486809, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1276
+ [2022-12-19 02:06:47,910] [INFO] [logging.py:68:log_dist] [Rank 0] step=3420, skipped=6, lr=[3.526666666666667e-06], mom=[[0.9, 0.999]]
1277
+ [2022-12-19 02:06:47,911] [INFO] [timer.py:196:stop] epoch=0/micro_step=3420/global_step=3420, RunningAvgSamplesPerSec=17.598935188193543, CurrSamplesPerSec=17.586873905797386, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1278
+ {'loss': 0.0003, 'learning_rate': 3.515555555555556e-06, 'epoch': 29.01}
1279
+ [2022-12-19 02:09:44,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=3430, skipped=6, lr=[3.5044444444444447e-06], mom=[[0.9, 0.999]]
1280
+ [2022-12-19 02:09:44,740] [INFO] [timer.py:196:stop] epoch=0/micro_step=3430/global_step=3430, RunningAvgSamplesPerSec=17.598879764126053, CurrSamplesPerSec=17.206409451147653, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1281
+ [2022-12-19 02:12:43,002] [INFO] [logging.py:68:log_dist] [Rank 0] step=3440, skipped=6, lr=[3.4822222222222223e-06], mom=[[0.9, 0.999]]
1282
+ [2022-12-19 02:12:43,004] [INFO] [timer.py:196:stop] epoch=0/micro_step=3440/global_step=3440, RunningAvgSamplesPerSec=17.59877386122984, CurrSamplesPerSec=17.647455022848266, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1283
+ [2022-12-19 02:15:49,935] [INFO] [logging.py:68:log_dist] [Rank 0] step=3450, skipped=6, lr=[3.46e-06], mom=[[0.9, 0.999]]
1284
+ [2022-12-19 02:15:49,936] [INFO] [timer.py:196:stop] epoch=0/micro_step=3450/global_step=3450, RunningAvgSamplesPerSec=17.59846592973248, CurrSamplesPerSec=17.705098144125525, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1285
+ {'loss': 0.0002, 'learning_rate': 3.46e-06, 'epoch': 29.02}
1286
+ [2022-12-19 02:19:02,277] [INFO] [logging.py:68:log_dist] [Rank 0] step=3460, skipped=6, lr=[3.4377777777777784e-06], mom=[[0.9, 0.999]]
1287
+ [2022-12-19 02:19:02,279] [INFO] [timer.py:196:stop] epoch=0/micro_step=3460/global_step=3460, RunningAvgSamplesPerSec=17.59887474281011, CurrSamplesPerSec=17.339971779112254, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1288
+ [2022-12-19 02:21:59,258] [INFO] [logging.py:68:log_dist] [Rank 0] step=3470, skipped=6, lr=[3.415555555555556e-06], mom=[[0.9, 0.999]]
1289
+ [2022-12-19 02:21:59,259] [INFO] [timer.py:196:stop] epoch=0/micro_step=3470/global_step=3470, RunningAvgSamplesPerSec=17.598957745973937, CurrSamplesPerSec=17.47932554863768, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1290
+ {'loss': 0.0002, 'learning_rate': 3.404444444444445e-06, 'epoch': 29.02}
1291
+ [2022-12-19 02:23:02,302] [INFO] [logging.py:68:log_dist] [Rank 0] step=3480, skipped=6, lr=[3.3933333333333336e-06], mom=[[0.9, 0.999]]
1292
+ [2022-12-19 02:23:02,304] [INFO] [timer.py:196:stop] epoch=0/micro_step=3480/global_step=3480, RunningAvgSamplesPerSec=17.600449447233288, CurrSamplesPerSec=23.367460582411866, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1293
+ [2022-12-19 02:27:40,487] [INFO] [logging.py:68:log_dist] [Rank 0] step=3490, skipped=6, lr=[3.371111111111111e-06], mom=[[0.9, 0.999]]
1294
+ [2022-12-19 02:27:40,488] [INFO] [timer.py:196:stop] epoch=0/micro_step=3490/global_step=3490, RunningAvgSamplesPerSec=17.600039649295883, CurrSamplesPerSec=17.403154157900058, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1295
+ [2022-12-19 02:30:34,939] [INFO] [logging.py:68:log_dist] [Rank 0] step=3500, skipped=6, lr=[3.3488888888888892e-06], mom=[[0.9, 0.999]]
1296
+ [2022-12-19 02:30:34,941] [INFO] [timer.py:196:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=17.600239248768926, CurrSamplesPerSec=17.787628920083662, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1297
+ {'loss': 0.0003, 'learning_rate': 3.3488888888888892e-06, 'epoch': 30.0}
1298
+ [2022-12-19 02:33:30,948] [INFO] [logging.py:68:log_dist] [Rank 0] step=3510, skipped=6, lr=[3.326666666666667e-06], mom=[[0.9, 0.999]]
1299
+ [2022-12-19 02:33:30,950] [INFO] [timer.py:196:stop] epoch=0/micro_step=3510/global_step=3510, RunningAvgSamplesPerSec=17.60011342639632, CurrSamplesPerSec=17.535298747587174, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1300
+ [2022-12-19 02:36:27,492] [INFO] [logging.py:68:log_dist] [Rank 0] step=3520, skipped=6, lr=[3.3044444444444445e-06], mom=[[0.9, 0.999]]
1301
+ [2022-12-19 02:36:27,493] [INFO] [timer.py:196:stop] epoch=0/micro_step=3520/global_step=3520, RunningAvgSamplesPerSec=17.60047227468804, CurrSamplesPerSec=17.78531311367044, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1302
+ {'loss': 0.0002, 'learning_rate': 3.2933333333333333e-06, 'epoch': 30.01}
1303
+ [2022-12-19 02:39:27,522] [INFO] [logging.py:68:log_dist] [Rank 0] step=3530, skipped=6, lr=[3.282222222222223e-06], mom=[[0.9, 0.999]]
1304
+ [2022-12-19 02:39:27,523] [INFO] [timer.py:196:stop] epoch=0/micro_step=3530/global_step=3530, RunningAvgSamplesPerSec=17.600453018305693, CurrSamplesPerSec=17.688962622702615, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1305
+ [2022-12-19 02:42:17,713] [INFO] [logging.py:68:log_dist] [Rank 0] step=3540, skipped=6, lr=[3.2600000000000006e-06], mom=[[0.9, 0.999]]
1306
+ [2022-12-19 02:42:17,714] [INFO] [timer.py:196:stop] epoch=0/micro_step=3540/global_step=3540, RunningAvgSamplesPerSec=17.600259041781936, CurrSamplesPerSec=17.62578056332801, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1307
+ [2022-12-19 02:45:09,551] [INFO] [logging.py:68:log_dist] [Rank 0] step=3550, skipped=6, lr=[3.237777777777778e-06], mom=[[0.9, 0.999]]
1308
+ [2022-12-19 02:45:09,553] [INFO] [timer.py:196:stop] epoch=0/micro_step=3550/global_step=3550, RunningAvgSamplesPerSec=17.60029625206999, CurrSamplesPerSec=17.671377775279126, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1309
+ {'loss': 0.0002, 'learning_rate': 3.237777777777778e-06, 'epoch': 30.01}
1310
+ [2022-12-19 02:48:06,078] [INFO] [logging.py:68:log_dist] [Rank 0] step=3560, skipped=6, lr=[3.2155555555555558e-06], mom=[[0.9, 0.999]]
1311
+ [2022-12-19 02:48:06,079] [INFO] [timer.py:196:stop] epoch=0/micro_step=3560/global_step=3560, RunningAvgSamplesPerSec=17.600329628889238, CurrSamplesPerSec=17.71231088469093, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1312
+ [2022-12-19 02:51:00,756] [INFO] [logging.py:68:log_dist] [Rank 0] step=3570, skipped=6, lr=[3.193333333333334e-06], mom=[[0.9, 0.999]]
1313
+ [2022-12-19 02:51:00,758] [INFO] [timer.py:196:stop] epoch=0/micro_step=3570/global_step=3570, RunningAvgSamplesPerSec=17.60022112966504, CurrSamplesPerSec=17.219200593227026, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1314
+ {'loss': 0.0002, 'learning_rate': 3.1822222222222226e-06, 'epoch': 30.02}
1315
+ [2022-12-19 02:53:51,858] [INFO] [logging.py:68:log_dist] [Rank 0] step=3580, skipped=6, lr=[3.1711111111111114e-06], mom=[[0.9, 0.999]]
1316
+ [2022-12-19 02:53:51,859] [INFO] [timer.py:196:stop] epoch=0/micro_step=3580/global_step=3580, RunningAvgSamplesPerSec=17.60064739187315, CurrSamplesPerSec=17.62833168565665, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1317
+ [2022-12-19 02:56:13,648] [INFO] [logging.py:68:log_dist] [Rank 0] step=3590, skipped=6, lr=[3.148888888888889e-06], mom=[[0.9, 0.999]]
1318
+ [2022-12-19 02:56:13,649] [INFO] [timer.py:196:stop] epoch=0/micro_step=3590/global_step=3590, RunningAvgSamplesPerSec=17.600825925087175, CurrSamplesPerSec=17.855607656356614, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1319
+ [2022-12-19 02:59:27,493] [INFO] [logging.py:68:log_dist] [Rank 0] step=3600, skipped=6, lr=[3.1266666666666667e-06], mom=[[0.9, 0.999]]
1320
+ [2022-12-19 02:59:27,494] [INFO] [timer.py:196:stop] epoch=0/micro_step=3600/global_step=3600, RunningAvgSamplesPerSec=17.601838241347792, CurrSamplesPerSec=17.581900003903662, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1321
+ {'loss': 0.0002, 'learning_rate': 3.1266666666666667e-06, 'epoch': 31.0}
1322
+ [2022-12-19 03:02:16,659] [INFO] [logging.py:68:log_dist] [Rank 0] step=3610, skipped=6, lr=[3.104444444444445e-06], mom=[[0.9, 0.999]]
1323
+ [2022-12-19 03:02:16,661] [INFO] [timer.py:196:stop] epoch=0/micro_step=3610/global_step=3610, RunningAvgSamplesPerSec=17.601819994166515, CurrSamplesPerSec=17.664054192793216, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1324
+ [2022-12-19 03:05:08,497] [INFO] [logging.py:68:log_dist] [Rank 0] step=3620, skipped=6, lr=[3.0822222222222227e-06], mom=[[0.9, 0.999]]
1325
+ [2022-12-19 03:05:08,498] [INFO] [timer.py:196:stop] epoch=0/micro_step=3620/global_step=3620, RunningAvgSamplesPerSec=17.60192426766313, CurrSamplesPerSec=17.875462043633274, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1326
+ {'loss': 0.0002, 'learning_rate': 3.0711111111111115e-06, 'epoch': 31.01}
1327
+ [2022-12-19 03:07:58,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=3630, skipped=6, lr=[3.0600000000000003e-06], mom=[[0.9, 0.999]]
1328
+ [2022-12-19 03:07:58,740] [INFO] [timer.py:196:stop] epoch=0/micro_step=3630/global_step=3630, RunningAvgSamplesPerSec=17.60179920974838, CurrSamplesPerSec=17.7938144302562, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1329
+ [2022-12-19 03:10:48,446] [INFO] [logging.py:68:log_dist] [Rank 0] step=3640, skipped=6, lr=[3.037777777777778e-06], mom=[[0.9, 0.999]]
1330
+ [2022-12-19 03:10:48,447] [INFO] [timer.py:196:stop] epoch=0/micro_step=3640/global_step=3640, RunningAvgSamplesPerSec=17.6017878320369, CurrSamplesPerSec=17.83189928619381, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1331
+ [2022-12-19 03:13:33,096] [INFO] [logging.py:68:log_dist] [Rank 0] step=3650, skipped=6, lr=[3.015555555555556e-06], mom=[[0.9, 0.999]]
1332
+ [2022-12-19 03:13:33,098] [INFO] [timer.py:196:stop] epoch=0/micro_step=3650/global_step=3650, RunningAvgSamplesPerSec=17.601565935375557, CurrSamplesPerSec=17.83511830383586, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1333
+ {'loss': 0.0002, 'learning_rate': 3.015555555555556e-06, 'epoch': 31.01}
1334
+ [2022-12-19 03:16:20,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=3660, skipped=6, lr=[2.9933333333333336e-06], mom=[[0.9, 0.999]]
1335
+ [2022-12-19 03:16:20,756] [INFO] [timer.py:196:stop] epoch=0/micro_step=3660/global_step=3660, RunningAvgSamplesPerSec=17.601526408905308, CurrSamplesPerSec=17.55972566487229, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1336
+ [2022-12-19 03:19:08,899] [INFO] [logging.py:68:log_dist] [Rank 0] step=3670, skipped=6, lr=[2.9711111111111112e-06], mom=[[0.9, 0.999]]
1337
+ [2022-12-19 03:19:08,901] [INFO] [timer.py:196:stop] epoch=0/micro_step=3670/global_step=3670, RunningAvgSamplesPerSec=17.601866020071387, CurrSamplesPerSec=17.7394257348548, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1338
+ {'loss': 0.0002, 'learning_rate': 2.96e-06, 'epoch': 31.02}
1339
+ [2022-12-19 03:22:05,005] [INFO] [logging.py:68:log_dist] [Rank 0] step=3680, skipped=6, lr=[2.948888888888889e-06], mom=[[0.9, 0.999]]
1340
+ [2022-12-19 03:22:05,006] [INFO] [timer.py:196:stop] epoch=0/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=17.602148084364135, CurrSamplesPerSec=17.71423246925165, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1341
+ [2022-12-19 03:25:01,391] [INFO] [logging.py:68:log_dist] [Rank 0] step=3690, skipped=6, lr=[2.9266666666666673e-06], mom=[[0.9, 0.999]]
1342
+ [2022-12-19 03:25:01,392] [INFO] [timer.py:196:stop] epoch=0/micro_step=3690/global_step=3690, RunningAvgSamplesPerSec=17.602265105171185, CurrSamplesPerSec=17.839620005142503, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1343
+ [2022-12-19 03:27:58,636] [INFO] [logging.py:68:log_dist] [Rank 0] step=3700, skipped=6, lr=[2.904444444444445e-06], mom=[[0.9, 0.999]]
1344
+ [2022-12-19 03:27:58,637] [INFO] [timer.py:196:stop] epoch=0/micro_step=3700/global_step=3700, RunningAvgSamplesPerSec=17.602024317328592, CurrSamplesPerSec=17.13545189933907, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1345
+ {'loss': 0.0002, 'learning_rate': 2.904444444444445e-06, 'epoch': 31.02}
1346
+ [2022-12-19 03:29:28,476] [INFO] [logging.py:68:log_dist] [Rank 0] step=3710, skipped=6, lr=[2.8822222222222225e-06], mom=[[0.9, 0.999]]
1347
+ [2022-12-19 03:29:28,478] [INFO] [timer.py:196:stop] epoch=0/micro_step=3710/global_step=3710, RunningAvgSamplesPerSec=17.60249366241964, CurrSamplesPerSec=17.683989065307227, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1348
+ [2022-12-19 03:33:40,636] [INFO] [logging.py:68:log_dist] [Rank 0] step=3720, skipped=6, lr=[2.86e-06], mom=[[0.9, 0.999]]
1349
+ [2022-12-19 03:33:40,637] [INFO] [timer.py:196:stop] epoch=0/micro_step=3720/global_step=3720, RunningAvgSamplesPerSec=17.603601659594958, CurrSamplesPerSec=17.493026261358214, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1350
+ {'loss': 0.0002, 'learning_rate': 2.8488888888888894e-06, 'epoch': 32.0}
1351
+ [2022-12-19 03:36:44,989] [INFO] [logging.py:68:log_dist] [Rank 0] step=3730, skipped=6, lr=[2.837777777777778e-06], mom=[[0.9, 0.999]]
1352
+ [2022-12-19 03:36:44,990] [INFO] [timer.py:196:stop] epoch=0/micro_step=3730/global_step=3730, RunningAvgSamplesPerSec=17.604015035004593, CurrSamplesPerSec=17.839234699560244, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1353
+ [2022-12-19 03:39:39,022] [INFO] [logging.py:68:log_dist] [Rank 0] step=3740, skipped=6, lr=[2.815555555555556e-06], mom=[[0.9, 0.999]]
1354
+ [2022-12-19 03:39:39,024] [INFO] [timer.py:196:stop] epoch=0/micro_step=3740/global_step=3740, RunningAvgSamplesPerSec=17.60423435088669, CurrSamplesPerSec=17.872316502218464, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1355
+ [2022-12-19 03:42:36,477] [INFO] [logging.py:68:log_dist] [Rank 0] step=3750, skipped=6, lr=[2.7933333333333334e-06], mom=[[0.9, 0.999]]
1356
+ [2022-12-19 03:42:36,479] [INFO] [timer.py:196:stop] epoch=0/micro_step=3750/global_step=3750, RunningAvgSamplesPerSec=17.60421286134035, CurrSamplesPerSec=17.6653445073596, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1357
+ {'loss': 0.0002, 'learning_rate': 2.7933333333333334e-06, 'epoch': 32.01}
1358
+ [2022-12-19 03:47:44,486] [INFO] [logging.py:68:log_dist] [Rank 0] step=3760, skipped=6, lr=[2.771111111111111e-06], mom=[[0.9, 0.999]]
1359
+ [2022-12-19 03:47:44,487] [INFO] [timer.py:196:stop] epoch=0/micro_step=3760/global_step=3760, RunningAvgSamplesPerSec=17.60421484650127, CurrSamplesPerSec=17.76200482461907, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1360
+ [2022-12-19 03:50:36,617] [INFO] [logging.py:68:log_dist] [Rank 0] step=3770, skipped=6, lr=[2.748888888888889e-06], mom=[[0.9, 0.999]]
1361
+ [2022-12-19 03:50:36,619] [INFO] [timer.py:196:stop] epoch=0/micro_step=3770/global_step=3770, RunningAvgSamplesPerSec=17.603986155075713, CurrSamplesPerSec=17.835418109567822, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1362
+ {'loss': 0.0002, 'learning_rate': 2.7377777777777783e-06, 'epoch': 32.01}
1363
+ [2022-12-19 03:53:34,204] [INFO] [logging.py:68:log_dist] [Rank 0] step=3780, skipped=6, lr=[2.726666666666667e-06], mom=[[0.9, 0.999]]
1364
+ [2022-12-19 03:53:34,206] [INFO] [timer.py:196:stop] epoch=0/micro_step=3780/global_step=3780, RunningAvgSamplesPerSec=17.604210162616486, CurrSamplesPerSec=17.369442334662356, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1365
+ [2022-12-19 03:56:31,105] [INFO] [logging.py:68:log_dist] [Rank 0] step=3790, skipped=6, lr=[2.7044444444444447e-06], mom=[[0.9, 0.999]]
1366
+ [2022-12-19 03:56:31,106] [INFO] [timer.py:196:stop] epoch=0/micro_step=3790/global_step=3790, RunningAvgSamplesPerSec=17.60431255398128, CurrSamplesPerSec=17.567898813891517, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1367
+ [2022-12-19 03:59:23,749] [INFO] [logging.py:68:log_dist] [Rank 0] step=3800, skipped=6, lr=[2.6822222222222223e-06], mom=[[0.9, 0.999]]
1368
+ [2022-12-19 03:59:23,750] [INFO] [timer.py:196:stop] epoch=0/micro_step=3800/global_step=3800, RunningAvgSamplesPerSec=17.60431104909582, CurrSamplesPerSec=17.70104341253724, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1369
+ {'loss': 0.0002, 'learning_rate': 2.6822222222222223e-06, 'epoch': 32.02}
1370
+ [2022-12-19 04:02:19,616] [INFO] [logging.py:68:log_dist] [Rank 0] step=3810, skipped=6, lr=[2.6600000000000004e-06], mom=[[0.9, 0.999]]
1371
+ [2022-12-19 04:02:19,617] [INFO] [timer.py:196:stop] epoch=0/micro_step=3810/global_step=3810, RunningAvgSamplesPerSec=17.604126410600333, CurrSamplesPerSec=17.600326048215376, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1372
+ [2022-12-19 04:05:13,803] [INFO] [logging.py:68:log_dist] [Rank 0] step=3820, skipped=6, lr=[2.637777777777778e-06], mom=[[0.9, 0.999]]
1373
+ [2022-12-19 04:05:13,805] [INFO] [timer.py:196:stop] epoch=0/micro_step=3820/global_step=3820, RunningAvgSamplesPerSec=17.60442226776567, CurrSamplesPerSec=17.521556925430364, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1374
+ {'loss': 0.0002, 'learning_rate': 2.6266666666666668e-06, 'epoch': 32.02}
1375
+ [2022-12-19 04:08:01,470] [INFO] [logging.py:68:log_dist] [Rank 0] step=3830, skipped=6, lr=[2.6155555555555556e-06], mom=[[0.9, 0.999]]
1376
+ [2022-12-19 04:08:01,472] [INFO] [timer.py:196:stop] epoch=0/micro_step=3830/global_step=3830, RunningAvgSamplesPerSec=17.605964489961178, CurrSamplesPerSec=17.66188200789337, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1377
+ [2022-12-19 04:10:57,833] [INFO] [logging.py:68:log_dist] [Rank 0] step=3840, skipped=6, lr=[2.5933333333333336e-06], mom=[[0.9, 0.999]]
1378
+ [2022-12-19 04:10:57,835] [INFO] [timer.py:196:stop] epoch=0/micro_step=3840/global_step=3840, RunningAvgSamplesPerSec=17.60612382198526, CurrSamplesPerSec=17.728355572744416, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1379
+ [2022-12-19 04:13:53,027] [INFO] [logging.py:68:log_dist] [Rank 0] step=3850, skipped=6, lr=[2.5711111111111112e-06], mom=[[0.9, 0.999]]
1380
+ [2022-12-19 04:13:53,028] [INFO] [timer.py:196:stop] epoch=0/micro_step=3850/global_step=3850, RunningAvgSamplesPerSec=17.606310071771578, CurrSamplesPerSec=17.81534865653578, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1381
+ {'loss': 0.0002, 'learning_rate': 2.5711111111111112e-06, 'epoch': 33.0}
1382
+ [2022-12-19 04:16:48,115] [INFO] [logging.py:68:log_dist] [Rank 0] step=3860, skipped=6, lr=[2.5488888888888893e-06], mom=[[0.9, 0.999]]
1383
+ [2022-12-19 04:16:48,116] [INFO] [timer.py:196:stop] epoch=0/micro_step=3860/global_step=3860, RunningAvgSamplesPerSec=17.60637394058431, CurrSamplesPerSec=17.27665271608909, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1384
+ [2022-12-19 04:19:44,284] [INFO] [logging.py:68:log_dist] [Rank 0] step=3870, skipped=6, lr=[2.526666666666667e-06], mom=[[0.9, 0.999]]
1385
+ [2022-12-19 04:19:44,286] [INFO] [timer.py:196:stop] epoch=0/micro_step=3870/global_step=3870, RunningAvgSamplesPerSec=17.606653162319823, CurrSamplesPerSec=17.38970420612655, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1386
+ {'loss': 0.0002, 'learning_rate': 2.5155555555555557e-06, 'epoch': 33.01}
1387
+ [2022-12-19 04:22:40,524] [INFO] [logging.py:68:log_dist] [Rank 0] step=3880, skipped=6, lr=[2.504444444444445e-06], mom=[[0.9, 0.999]]
1388
+ [2022-12-19 04:22:40,525] [INFO] [timer.py:196:stop] epoch=0/micro_step=3880/global_step=3880, RunningAvgSamplesPerSec=17.606719113033467, CurrSamplesPerSec=17.757985076265896, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1389
+ [2022-12-19 04:25:38,514] [INFO] [logging.py:68:log_dist] [Rank 0] step=3890, skipped=6, lr=[2.4822222222222225e-06], mom=[[0.9, 0.999]]
1390
+ [2022-12-19 04:25:38,516] [INFO] [timer.py:196:stop] epoch=0/micro_step=3890/global_step=3890, RunningAvgSamplesPerSec=17.60689324457354, CurrSamplesPerSec=17.80211727021623, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1391
+ [2022-12-19 04:28:36,061] [INFO] [logging.py:68:log_dist] [Rank 0] step=3900, skipped=6, lr=[2.46e-06], mom=[[0.9, 0.999]]
1392
+ [2022-12-19 04:28:36,063] [INFO] [timer.py:196:stop] epoch=0/micro_step=3900/global_step=3900, RunningAvgSamplesPerSec=17.607112140420526, CurrSamplesPerSec=17.612398120420085, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1393
+ {'loss': 0.0002, 'learning_rate': 2.46e-06, 'epoch': 33.01}
1394
+ [2022-12-19 04:31:31,396] [INFO] [logging.py:68:log_dist] [Rank 0] step=3910, skipped=6, lr=[2.437777777777778e-06], mom=[[0.9, 0.999]]
1395
+ [2022-12-19 04:31:31,398] [INFO] [timer.py:196:stop] epoch=0/micro_step=3910/global_step=3910, RunningAvgSamplesPerSec=17.607239439571316, CurrSamplesPerSec=17.700039648390714, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1396
+ [2022-12-19 04:34:29,911] [INFO] [logging.py:68:log_dist] [Rank 0] step=3920, skipped=6, lr=[2.415555555555556e-06], mom=[[0.9, 0.999]]
1397
+ [2022-12-19 04:34:29,912] [INFO] [timer.py:196:stop] epoch=0/micro_step=3920/global_step=3920, RunningAvgSamplesPerSec=17.60749252768418, CurrSamplesPerSec=17.79277653010592, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1398
+ {'loss': 0.0002, 'learning_rate': 2.4044444444444446e-06, 'epoch': 33.02}
1399
+ [2022-12-19 04:37:31,484] [INFO] [logging.py:68:log_dist] [Rank 0] step=3930, skipped=6, lr=[2.3933333333333334e-06], mom=[[0.9, 0.999]]
1400
+ [2022-12-19 04:37:31,485] [INFO] [timer.py:196:stop] epoch=0/micro_step=3930/global_step=3930, RunningAvgSamplesPerSec=17.607728287544596, CurrSamplesPerSec=17.50504171864284, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1401
+ [2022-12-19 04:39:30,333] [INFO] [logging.py:68:log_dist] [Rank 0] step=3940, skipped=6, lr=[2.371111111111111e-06], mom=[[0.9, 0.999]]
1402
+ [2022-12-19 04:39:30,335] [INFO] [timer.py:196:stop] epoch=0/micro_step=3940/global_step=3940, RunningAvgSamplesPerSec=17.60803244226165, CurrSamplesPerSec=17.897954595339627, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1403
+ [2022-12-19 04:43:16,008] [INFO] [logging.py:68:log_dist] [Rank 0] step=3950, skipped=6, lr=[2.348888888888889e-06], mom=[[0.9, 0.999]]
1404
+ [2022-12-19 04:43:16,010] [INFO] [timer.py:196:stop] epoch=0/micro_step=3950/global_step=3950, RunningAvgSamplesPerSec=17.609317404387298, CurrSamplesPerSec=17.338718477525717, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1405
+ {'loss': 0.0002, 'learning_rate': 2.348888888888889e-06, 'epoch': 34.0}
1406
+ [2022-12-19 04:46:17,542] [INFO] [logging.py:68:log_dist] [Rank 0] step=3960, skipped=6, lr=[2.3266666666666667e-06], mom=[[0.9, 0.999]]
1407
+ [2022-12-19 04:46:17,543] [INFO] [timer.py:196:stop] epoch=0/micro_step=3960/global_step=3960, RunningAvgSamplesPerSec=17.609609085156343, CurrSamplesPerSec=17.51949968202784, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1408
+ [2022-12-19 04:49:16,425] [INFO] [logging.py:68:log_dist] [Rank 0] step=3970, skipped=6, lr=[2.3044444444444447e-06], mom=[[0.9, 0.999]]
1409
+ [2022-12-19 04:49:16,426] [INFO] [timer.py:196:stop] epoch=0/micro_step=3970/global_step=3970, RunningAvgSamplesPerSec=17.609706216869625, CurrSamplesPerSec=17.619688175337636, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1410
+ {'loss': 0.0002, 'learning_rate': 2.2933333333333335e-06, 'epoch': 34.01}
1411
+ [2022-12-19 04:52:13,918] [INFO] [logging.py:68:log_dist] [Rank 0] step=3980, skipped=6, lr=[2.2822222222222223e-06], mom=[[0.9, 0.999]]
1412
+ [2022-12-19 04:52:13,919] [INFO] [timer.py:196:stop] epoch=0/micro_step=3980/global_step=3980, RunningAvgSamplesPerSec=17.60986454647832, CurrSamplesPerSec=17.457695134212692, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1413
+ [2022-12-19 04:55:16,822] [INFO] [logging.py:68:log_dist] [Rank 0] step=3990, skipped=6, lr=[2.2600000000000004e-06], mom=[[0.9, 0.999]]
1414
+ [2022-12-19 04:55:16,825] [INFO] [timer.py:196:stop] epoch=0/micro_step=3990/global_step=3990, RunningAvgSamplesPerSec=17.610149442844655, CurrSamplesPerSec=17.861005041208294, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1415
+ [2022-12-19 04:58:14,067] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=6, lr=[2.237777777777778e-06], mom=[[0.9, 0.999]]
1416
+ [2022-12-19 04:58:14,068] [INFO] [timer.py:196:stop] epoch=0/micro_step=4000/global_step=4000, RunningAvgSamplesPerSec=17.61035876229219, CurrSamplesPerSec=17.883168119082438, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
1417
+ {'loss': 0.0002, 'learning_rate': 2.237777777777778e-06, 'epoch': 34.01}
1418
+ {'eval_loss': 0.338134765625, 'eval_wer': 17.7977496284764, 'eval_runtime': 1277.681, 'eval_samples_per_second': 3.02, 'eval_steps_per_second': 0.095, 'epoch': 34.01}
1419
+ [2022-12-19 05:19:32,876] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step4000 is begin to save!
1420
+ [2022-12-19 05:19:32,885] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt
1421
+ [2022-12-19 05:19:32,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt...
1422
+ [2022-12-19 05:19:33,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt.
1423
+ [2022-12-19 05:19:33,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
1424
+ [2022-12-19 05:19:38,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1425
+ [2022-12-19 05:19:38,828] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1426
+ [2022-12-19 05:19:38,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
runs/Dec18_08-41-04_fe2747a042f0/events.out.tfevents.1671381730.fe2747a042f0.46148.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a37a4cf089b78acaa4b81c1ba9e905b2d9114aaab16e4c765d044d3015162c0d
3
- size 24055
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f809800ca6de0a5be418d3f2830d96d4dc56d4a9ab41574b9b5ebe7730f0eee9
3
+ size 30653