Training in progress, step 4000
Browse files
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 483536061
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:51a199ee5e10cdb49b6781596fab3d076f0c14f34e7f0a0212b16834cbb296c6
|
3 |
size 483536061
|
run.log
CHANGED
@@ -1173,3 +1173,254 @@ Rank: 0 partition count [1] and sizes[(241734912, False)]
|
|
1173 |
[2022-12-19 00:04:18,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1174 |
[2022-12-19 00:04:18,797] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1175 |
[2022-12-19 00:04:18,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1173 |
[2022-12-19 00:04:18,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1174 |
[2022-12-19 00:04:18,797] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1175 |
[2022-12-19 00:04:18,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now!
|
1176 |
+
[2022-12-19 00:06:55,197] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 65536.0
|
1177 |
+
[2022-12-19 00:07:13,410] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
|
1178 |
+
[2022-12-19 00:07:53,402] [INFO] [logging.py:68:log_dist] [Rank 0] step=3010, skipped=6, lr=[4.437777777777778e-06], mom=[[0.9, 0.999]]
|
1179 |
+
[2022-12-19 00:07:53,403] [INFO] [timer.py:196:stop] epoch=0/micro_step=3010/global_step=3010, RunningAvgSamplesPerSec=17.591068861221267, CurrSamplesPerSec=17.764448583346613, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1180 |
+
[2022-12-19 00:11:06,978] [INFO] [logging.py:68:log_dist] [Rank 0] step=3020, skipped=6, lr=[4.415555555555556e-06], mom=[[0.9, 0.999]]
|
1181 |
+
[2022-12-19 00:11:06,980] [INFO] [timer.py:196:stop] epoch=0/micro_step=3020/global_step=3020, RunningAvgSamplesPerSec=17.592794094238695, CurrSamplesPerSec=17.57114513059809, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1182 |
+
{'loss': 0.0003, 'learning_rate': 4.404444444444445e-06, 'epoch': 26.0}
|
1183 |
+
[2022-12-19 00:14:04,963] [INFO] [logging.py:68:log_dist] [Rank 0] step=3030, skipped=6, lr=[4.393333333333334e-06], mom=[[0.9, 0.999]]
|
1184 |
+
[2022-12-19 00:14:04,964] [INFO] [timer.py:196:stop] epoch=0/micro_step=3030/global_step=3030, RunningAvgSamplesPerSec=17.593074428336145, CurrSamplesPerSec=17.549051816080418, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1185 |
+
[2022-12-19 00:16:56,281] [INFO] [logging.py:68:log_dist] [Rank 0] step=3040, skipped=6, lr=[4.371111111111112e-06], mom=[[0.9, 0.999]]
|
1186 |
+
[2022-12-19 00:16:56,282] [INFO] [timer.py:196:stop] epoch=0/micro_step=3040/global_step=3040, RunningAvgSamplesPerSec=17.593342880204634, CurrSamplesPerSec=17.68683907958883, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1187 |
+
[2022-12-19 00:19:47,359] [INFO] [logging.py:68:log_dist] [Rank 0] step=3050, skipped=6, lr=[4.348888888888889e-06], mom=[[0.9, 0.999]]
|
1188 |
+
[2022-12-19 00:19:47,360] [INFO] [timer.py:196:stop] epoch=0/micro_step=3050/global_step=3050, RunningAvgSamplesPerSec=17.593460947035926, CurrSamplesPerSec=17.70367941895282, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1189 |
+
{'loss': 0.0003, 'learning_rate': 4.348888888888889e-06, 'epoch': 26.01}
|
1190 |
+
[2022-12-19 00:22:42,032] [INFO] [logging.py:68:log_dist] [Rank 0] step=3060, skipped=6, lr=[4.326666666666667e-06], mom=[[0.9, 0.999]]
|
1191 |
+
[2022-12-19 00:22:42,034] [INFO] [timer.py:196:stop] epoch=0/micro_step=3060/global_step=3060, RunningAvgSamplesPerSec=17.59375044318361, CurrSamplesPerSec=17.384547334188284, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1192 |
+
[2022-12-19 00:25:34,669] [INFO] [logging.py:68:log_dist] [Rank 0] step=3070, skipped=6, lr=[4.304444444444445e-06], mom=[[0.9, 0.999]]
|
1193 |
+
[2022-12-19 00:25:34,671] [INFO] [timer.py:196:stop] epoch=0/micro_step=3070/global_step=3070, RunningAvgSamplesPerSec=17.5937122892984, CurrSamplesPerSec=17.613768735779285, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1194 |
+
{'loss': 0.0003, 'learning_rate': 4.2933333333333334e-06, 'epoch': 26.01}
|
1195 |
+
[2022-12-19 00:28:25,299] [INFO] [logging.py:68:log_dist] [Rank 0] step=3080, skipped=6, lr=[4.282222222222222e-06], mom=[[0.9, 0.999]]
|
1196 |
+
[2022-12-19 00:28:25,300] [INFO] [timer.py:196:stop] epoch=0/micro_step=3080/global_step=3080, RunningAvgSamplesPerSec=17.593914468788185, CurrSamplesPerSec=17.718895564207596, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1197 |
+
[2022-12-19 00:31:21,700] [INFO] [logging.py:68:log_dist] [Rank 0] step=3090, skipped=6, lr=[4.26e-06], mom=[[0.9, 0.999]]
|
1198 |
+
[2022-12-19 00:31:21,701] [INFO] [timer.py:196:stop] epoch=0/micro_step=3090/global_step=3090, RunningAvgSamplesPerSec=17.593445388805577, CurrSamplesPerSec=17.325692189061847, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1199 |
+
[2022-12-19 00:34:16,335] [INFO] [logging.py:68:log_dist] [Rank 0] step=3100, skipped=6, lr=[4.2377777777777775e-06], mom=[[0.9, 0.999]]
|
1200 |
+
[2022-12-19 00:34:16,336] [INFO] [timer.py:196:stop] epoch=0/micro_step=3100/global_step=3100, RunningAvgSamplesPerSec=17.59348166098487, CurrSamplesPerSec=17.3108812570425, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1201 |
+
{'loss': 0.0003, 'learning_rate': 4.2377777777777775e-06, 'epoch': 26.02}
|
1202 |
+
[2022-12-19 00:37:12,056] [INFO] [logging.py:68:log_dist] [Rank 0] step=3110, skipped=6, lr=[4.215555555555556e-06], mom=[[0.9, 0.999]]
|
1203 |
+
[2022-12-19 00:37:12,058] [INFO] [timer.py:196:stop] epoch=0/micro_step=3110/global_step=3110, RunningAvgSamplesPerSec=17.593623928588435, CurrSamplesPerSec=17.863522480875613, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1204 |
+
[2022-12-19 00:40:07,743] [INFO] [logging.py:68:log_dist] [Rank 0] step=3120, skipped=6, lr=[4.1933333333333336e-06], mom=[[0.9, 0.999]]
|
1205 |
+
[2022-12-19 00:40:07,744] [INFO] [timer.py:196:stop] epoch=0/micro_step=3120/global_step=3120, RunningAvgSamplesPerSec=17.593697302631174, CurrSamplesPerSec=17.08051252585759, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1206 |
+
{'loss': 0.0003, 'learning_rate': 4.182222222222222e-06, 'epoch': 26.02}
|
1207 |
+
[2022-12-19 00:41:38,634] [INFO] [logging.py:68:log_dist] [Rank 0] step=3130, skipped=6, lr=[4.171111111111111e-06], mom=[[0.9, 0.999]]
|
1208 |
+
[2022-12-19 00:41:38,636] [INFO] [timer.py:196:stop] epoch=0/micro_step=3130/global_step=3130, RunningAvgSamplesPerSec=17.593999749820743, CurrSamplesPerSec=17.741973511374713, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1209 |
+
[2022-12-19 00:45:51,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=3140, skipped=6, lr=[4.148888888888889e-06], mom=[[0.9, 0.999]]
|
1210 |
+
[2022-12-19 00:45:51,865] [INFO] [timer.py:196:stop] epoch=0/micro_step=3140/global_step=3140, RunningAvgSamplesPerSec=17.595327756966622, CurrSamplesPerSec=17.445340938254706, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1211 |
+
[2022-12-19 00:48:49,392] [INFO] [logging.py:68:log_dist] [Rank 0] step=3150, skipped=6, lr=[4.126666666666667e-06], mom=[[0.9, 0.999]]
|
1212 |
+
[2022-12-19 00:48:49,394] [INFO] [timer.py:196:stop] epoch=0/micro_step=3150/global_step=3150, RunningAvgSamplesPerSec=17.595551909670867, CurrSamplesPerSec=17.525080176118156, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1213 |
+
{'loss': 0.0003, 'learning_rate': 4.126666666666667e-06, 'epoch': 27.0}
|
1214 |
+
[2022-12-19 00:51:44,912] [INFO] [logging.py:68:log_dist] [Rank 0] step=3160, skipped=6, lr=[4.104444444444445e-06], mom=[[0.9, 0.999]]
|
1215 |
+
[2022-12-19 00:51:44,914] [INFO] [timer.py:196:stop] epoch=0/micro_step=3160/global_step=3160, RunningAvgSamplesPerSec=17.59533705410767, CurrSamplesPerSec=17.492849569178176, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1216 |
+
[2022-12-19 00:54:39,253] [INFO] [logging.py:68:log_dist] [Rank 0] step=3170, skipped=6, lr=[4.0822222222222225e-06], mom=[[0.9, 0.999]]
|
1217 |
+
[2022-12-19 00:54:39,254] [INFO] [timer.py:196:stop] epoch=0/micro_step=3170/global_step=3170, RunningAvgSamplesPerSec=17.595644102704394, CurrSamplesPerSec=17.775633328909876, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1218 |
+
{'loss': 0.0003, 'learning_rate': 4.071111111111111e-06, 'epoch': 27.01}
|
1219 |
+
[2022-12-19 00:57:32,029] [INFO] [logging.py:68:log_dist] [Rank 0] step=3180, skipped=6, lr=[4.060000000000001e-06], mom=[[0.9, 0.999]]
|
1220 |
+
[2022-12-19 00:57:32,031] [INFO] [timer.py:196:stop] epoch=0/micro_step=3180/global_step=3180, RunningAvgSamplesPerSec=17.59573223021548, CurrSamplesPerSec=17.551840136671807, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1221 |
+
[2022-12-19 01:00:26,835] [INFO] [logging.py:68:log_dist] [Rank 0] step=3190, skipped=6, lr=[4.0377777777777786e-06], mom=[[0.9, 0.999]]
|
1222 |
+
[2022-12-19 01:00:26,836] [INFO] [timer.py:196:stop] epoch=0/micro_step=3190/global_step=3190, RunningAvgSamplesPerSec=17.59571717237336, CurrSamplesPerSec=17.490995086617858, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1223 |
+
[2022-12-19 01:03:19,686] [INFO] [logging.py:68:log_dist] [Rank 0] step=3200, skipped=6, lr=[4.015555555555556e-06], mom=[[0.9, 0.999]]
|
1224 |
+
[2022-12-19 01:03:19,688] [INFO] [timer.py:196:stop] epoch=0/micro_step=3200/global_step=3200, RunningAvgSamplesPerSec=17.59580399188835, CurrSamplesPerSec=17.34269068867683, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1225 |
+
{'loss': 0.0003, 'learning_rate': 4.015555555555556e-06, 'epoch': 27.01}
|
1226 |
+
[2022-12-19 01:06:14,654] [INFO] [logging.py:68:log_dist] [Rank 0] step=3210, skipped=6, lr=[3.993333333333334e-06], mom=[[0.9, 0.999]]
|
1227 |
+
[2022-12-19 01:06:14,655] [INFO] [timer.py:196:stop] epoch=0/micro_step=3210/global_step=3210, RunningAvgSamplesPerSec=17.596046138360002, CurrSamplesPerSec=17.725228814430604, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1228 |
+
[2022-12-19 01:09:07,511] [INFO] [logging.py:68:log_dist] [Rank 0] step=3220, skipped=6, lr=[3.971111111111111e-06], mom=[[0.9, 0.999]]
|
1229 |
+
[2022-12-19 01:09:07,512] [INFO] [timer.py:196:stop] epoch=0/micro_step=3220/global_step=3220, RunningAvgSamplesPerSec=17.59647348251674, CurrSamplesPerSec=17.602937912133203, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1230 |
+
{'loss': 0.0003, 'learning_rate': 3.96e-06, 'epoch': 27.02}
|
1231 |
+
[2022-12-19 01:12:01,533] [INFO] [logging.py:68:log_dist] [Rank 0] step=3230, skipped=6, lr=[3.948888888888889e-06], mom=[[0.9, 0.999]]
|
1232 |
+
[2022-12-19 01:12:01,535] [INFO] [timer.py:196:stop] epoch=0/micro_step=3230/global_step=3230, RunningAvgSamplesPerSec=17.596052986386066, CurrSamplesPerSec=17.566659481255172, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1233 |
+
[2022-12-19 01:14:55,505] [INFO] [logging.py:68:log_dist] [Rank 0] step=3240, skipped=6, lr=[3.926666666666667e-06], mom=[[0.9, 0.999]]
|
1234 |
+
[2022-12-19 01:14:55,507] [INFO] [timer.py:196:stop] epoch=0/micro_step=3240/global_step=3240, RunningAvgSamplesPerSec=17.596036037139132, CurrSamplesPerSec=17.498455300383174, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1235 |
+
[2022-12-19 01:17:42,447] [INFO] [logging.py:68:log_dist] [Rank 0] step=3250, skipped=6, lr=[3.904444444444444e-06], mom=[[0.9, 0.999]]
|
1236 |
+
[2022-12-19 01:17:42,449] [INFO] [timer.py:196:stop] epoch=0/micro_step=3250/global_step=3250, RunningAvgSamplesPerSec=17.597616914434525, CurrSamplesPerSec=17.628235600063963, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1237 |
+
{'loss': 0.0003, 'learning_rate': 3.904444444444444e-06, 'epoch': 28.0}
|
1238 |
+
[2022-12-19 01:20:40,342] [INFO] [logging.py:68:log_dist] [Rank 0] step=3260, skipped=6, lr=[3.882222222222223e-06], mom=[[0.9, 0.999]]
|
1239 |
+
[2022-12-19 01:20:40,344] [INFO] [timer.py:196:stop] epoch=0/micro_step=3260/global_step=3260, RunningAvgSamplesPerSec=17.59767515749172, CurrSamplesPerSec=17.64395546265293, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1240 |
+
[2022-12-19 01:23:41,372] [INFO] [logging.py:68:log_dist] [Rank 0] step=3270, skipped=6, lr=[3.86e-06], mom=[[0.9, 0.999]]
|
1241 |
+
[2022-12-19 01:23:41,373] [INFO] [timer.py:196:stop] epoch=0/micro_step=3270/global_step=3270, RunningAvgSamplesPerSec=17.597920364006136, CurrSamplesPerSec=17.849487124433125, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1242 |
+
{'loss': 0.0003, 'learning_rate': 3.848888888888889e-06, 'epoch': 28.01}
|
1243 |
+
[2022-12-19 01:26:34,140] [INFO] [logging.py:68:log_dist] [Rank 0] step=3280, skipped=6, lr=[3.837777777777778e-06], mom=[[0.9, 0.999]]
|
1244 |
+
[2022-12-19 01:26:34,141] [INFO] [timer.py:196:stop] epoch=0/micro_step=3280/global_step=3280, RunningAvgSamplesPerSec=17.597884174195094, CurrSamplesPerSec=17.131544523208934, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1245 |
+
[2022-12-19 01:29:24,859] [INFO] [logging.py:68:log_dist] [Rank 0] step=3290, skipped=6, lr=[3.8155555555555555e-06], mom=[[0.9, 0.999]]
|
1246 |
+
[2022-12-19 01:29:24,861] [INFO] [timer.py:196:stop] epoch=0/micro_step=3290/global_step=3290, RunningAvgSamplesPerSec=17.597922188226125, CurrSamplesPerSec=17.589932443660675, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1247 |
+
[2022-12-19 01:32:24,242] [INFO] [logging.py:68:log_dist] [Rank 0] step=3300, skipped=6, lr=[3.793333333333334e-06], mom=[[0.9, 0.999]]
|
1248 |
+
[2022-12-19 01:32:24,244] [INFO] [timer.py:196:stop] epoch=0/micro_step=3300/global_step=3300, RunningAvgSamplesPerSec=17.59799470754608, CurrSamplesPerSec=17.711194827166782, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1249 |
+
{'loss': 0.0003, 'learning_rate': 3.793333333333334e-06, 'epoch': 28.01}
|
1250 |
+
[2022-12-19 01:35:16,557] [INFO] [logging.py:68:log_dist] [Rank 0] step=3310, skipped=6, lr=[3.7711111111111116e-06], mom=[[0.9, 0.999]]
|
1251 |
+
[2022-12-19 01:35:16,558] [INFO] [timer.py:196:stop] epoch=0/micro_step=3310/global_step=3310, RunningAvgSamplesPerSec=17.598131006457837, CurrSamplesPerSec=17.669246821453136, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1252 |
+
[2022-12-19 01:38:09,745] [INFO] [logging.py:68:log_dist] [Rank 0] step=3320, skipped=6, lr=[3.7488888888888892e-06], mom=[[0.9, 0.999]]
|
1253 |
+
[2022-12-19 01:38:09,747] [INFO] [timer.py:196:stop] epoch=0/micro_step=3320/global_step=3320, RunningAvgSamplesPerSec=17.59810478665201, CurrSamplesPerSec=17.712917471280523, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1254 |
+
{'loss': 0.0003, 'learning_rate': 3.737777777777778e-06, 'epoch': 28.02}
|
1255 |
+
[2022-12-19 01:40:59,612] [INFO] [logging.py:68:log_dist] [Rank 0] step=3330, skipped=6, lr=[3.726666666666667e-06], mom=[[0.9, 0.999]]
|
1256 |
+
[2022-12-19 01:40:59,613] [INFO] [timer.py:196:stop] epoch=0/micro_step=3330/global_step=3330, RunningAvgSamplesPerSec=17.5981314985267, CurrSamplesPerSec=17.433335941926863, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1257 |
+
[2022-12-19 01:43:50,657] [INFO] [logging.py:68:log_dist] [Rank 0] step=3340, skipped=6, lr=[3.704444444444445e-06], mom=[[0.9, 0.999]]
|
1258 |
+
[2022-12-19 01:43:50,659] [INFO] [timer.py:196:stop] epoch=0/micro_step=3340/global_step=3340, RunningAvgSamplesPerSec=17.59849399704335, CurrSamplesPerSec=17.73495453276351, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1259 |
+
[2022-12-19 01:46:42,496] [INFO] [logging.py:68:log_dist] [Rank 0] step=3350, skipped=6, lr=[3.6822222222222225e-06], mom=[[0.9, 0.999]]
|
1260 |
+
[2022-12-19 01:46:42,497] [INFO] [timer.py:196:stop] epoch=0/micro_step=3350/global_step=3350, RunningAvgSamplesPerSec=17.598006416218745, CurrSamplesPerSec=16.950657243750207, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1261 |
+
{'loss': 0.0003, 'learning_rate': 3.6822222222222225e-06, 'epoch': 28.02}
|
1262 |
+
[2022-12-19 01:48:39,809] [INFO] [logging.py:68:log_dist] [Rank 0] step=3360, skipped=6, lr=[3.66e-06], mom=[[0.9, 0.999]]
|
1263 |
+
[2022-12-19 01:48:39,811] [INFO] [timer.py:196:stop] epoch=0/micro_step=3360/global_step=3360, RunningAvgSamplesPerSec=17.59821946036748, CurrSamplesPerSec=17.671893143144626, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1264 |
+
[2022-12-19 01:52:23,796] [INFO] [logging.py:68:log_dist] [Rank 0] step=3370, skipped=6, lr=[3.6377777777777777e-06], mom=[[0.9, 0.999]]
|
1265 |
+
[2022-12-19 01:52:23,797] [INFO] [timer.py:196:stop] epoch=0/micro_step=3370/global_step=3370, RunningAvgSamplesPerSec=17.599458275412836, CurrSamplesPerSec=17.74689643878602, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1266 |
+
{'loss': 0.0003, 'learning_rate': 3.6266666666666674e-06, 'epoch': 29.0}
|
1267 |
+
[2022-12-19 01:55:17,361] [INFO] [logging.py:68:log_dist] [Rank 0] step=3380, skipped=6, lr=[3.615555555555556e-06], mom=[[0.9, 0.999]]
|
1268 |
+
[2022-12-19 01:55:17,362] [INFO] [timer.py:196:stop] epoch=0/micro_step=3380/global_step=3380, RunningAvgSamplesPerSec=17.599580206175794, CurrSamplesPerSec=17.84562699614934, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1269 |
+
[2022-12-19 01:58:09,807] [INFO] [logging.py:68:log_dist] [Rank 0] step=3390, skipped=6, lr=[3.593333333333334e-06], mom=[[0.9, 0.999]]
|
1270 |
+
[2022-12-19 01:58:09,809] [INFO] [timer.py:196:stop] epoch=0/micro_step=3390/global_step=3390, RunningAvgSamplesPerSec=17.599584006196714, CurrSamplesPerSec=17.690764887441887, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1271 |
+
[2022-12-19 02:01:00,280] [INFO] [logging.py:68:log_dist] [Rank 0] step=3400, skipped=6, lr=[3.5711111111111114e-06], mom=[[0.9, 0.999]]
|
1272 |
+
[2022-12-19 02:01:00,282] [INFO] [timer.py:196:stop] epoch=0/micro_step=3400/global_step=3400, RunningAvgSamplesPerSec=17.599061452055388, CurrSamplesPerSec=17.68211014301976, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1273 |
+
{'loss': 0.0003, 'learning_rate': 3.5711111111111114e-06, 'epoch': 29.01}
|
1274 |
+
[2022-12-19 02:03:54,109] [INFO] [logging.py:68:log_dist] [Rank 0] step=3410, skipped=6, lr=[3.548888888888889e-06], mom=[[0.9, 0.999]]
|
1275 |
+
[2022-12-19 02:03:54,110] [INFO] [timer.py:196:stop] epoch=0/micro_step=3410/global_step=3410, RunningAvgSamplesPerSec=17.599020297195587, CurrSamplesPerSec=17.79227531486809, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1276 |
+
[2022-12-19 02:06:47,910] [INFO] [logging.py:68:log_dist] [Rank 0] step=3420, skipped=6, lr=[3.526666666666667e-06], mom=[[0.9, 0.999]]
|
1277 |
+
[2022-12-19 02:06:47,911] [INFO] [timer.py:196:stop] epoch=0/micro_step=3420/global_step=3420, RunningAvgSamplesPerSec=17.598935188193543, CurrSamplesPerSec=17.586873905797386, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1278 |
+
{'loss': 0.0003, 'learning_rate': 3.515555555555556e-06, 'epoch': 29.01}
|
1279 |
+
[2022-12-19 02:09:44,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=3430, skipped=6, lr=[3.5044444444444447e-06], mom=[[0.9, 0.999]]
|
1280 |
+
[2022-12-19 02:09:44,740] [INFO] [timer.py:196:stop] epoch=0/micro_step=3430/global_step=3430, RunningAvgSamplesPerSec=17.598879764126053, CurrSamplesPerSec=17.206409451147653, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1281 |
+
[2022-12-19 02:12:43,002] [INFO] [logging.py:68:log_dist] [Rank 0] step=3440, skipped=6, lr=[3.4822222222222223e-06], mom=[[0.9, 0.999]]
|
1282 |
+
[2022-12-19 02:12:43,004] [INFO] [timer.py:196:stop] epoch=0/micro_step=3440/global_step=3440, RunningAvgSamplesPerSec=17.59877386122984, CurrSamplesPerSec=17.647455022848266, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1283 |
+
[2022-12-19 02:15:49,935] [INFO] [logging.py:68:log_dist] [Rank 0] step=3450, skipped=6, lr=[3.46e-06], mom=[[0.9, 0.999]]
|
1284 |
+
[2022-12-19 02:15:49,936] [INFO] [timer.py:196:stop] epoch=0/micro_step=3450/global_step=3450, RunningAvgSamplesPerSec=17.59846592973248, CurrSamplesPerSec=17.705098144125525, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1285 |
+
{'loss': 0.0002, 'learning_rate': 3.46e-06, 'epoch': 29.02}
|
1286 |
+
[2022-12-19 02:19:02,277] [INFO] [logging.py:68:log_dist] [Rank 0] step=3460, skipped=6, lr=[3.4377777777777784e-06], mom=[[0.9, 0.999]]
|
1287 |
+
[2022-12-19 02:19:02,279] [INFO] [timer.py:196:stop] epoch=0/micro_step=3460/global_step=3460, RunningAvgSamplesPerSec=17.59887474281011, CurrSamplesPerSec=17.339971779112254, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1288 |
+
[2022-12-19 02:21:59,258] [INFO] [logging.py:68:log_dist] [Rank 0] step=3470, skipped=6, lr=[3.415555555555556e-06], mom=[[0.9, 0.999]]
|
1289 |
+
[2022-12-19 02:21:59,259] [INFO] [timer.py:196:stop] epoch=0/micro_step=3470/global_step=3470, RunningAvgSamplesPerSec=17.598957745973937, CurrSamplesPerSec=17.47932554863768, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1290 |
+
{'loss': 0.0002, 'learning_rate': 3.404444444444445e-06, 'epoch': 29.02}
|
1291 |
+
[2022-12-19 02:23:02,302] [INFO] [logging.py:68:log_dist] [Rank 0] step=3480, skipped=6, lr=[3.3933333333333336e-06], mom=[[0.9, 0.999]]
|
1292 |
+
[2022-12-19 02:23:02,304] [INFO] [timer.py:196:stop] epoch=0/micro_step=3480/global_step=3480, RunningAvgSamplesPerSec=17.600449447233288, CurrSamplesPerSec=23.367460582411866, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1293 |
+
[2022-12-19 02:27:40,487] [INFO] [logging.py:68:log_dist] [Rank 0] step=3490, skipped=6, lr=[3.371111111111111e-06], mom=[[0.9, 0.999]]
|
1294 |
+
[2022-12-19 02:27:40,488] [INFO] [timer.py:196:stop] epoch=0/micro_step=3490/global_step=3490, RunningAvgSamplesPerSec=17.600039649295883, CurrSamplesPerSec=17.403154157900058, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1295 |
+
[2022-12-19 02:30:34,939] [INFO] [logging.py:68:log_dist] [Rank 0] step=3500, skipped=6, lr=[3.3488888888888892e-06], mom=[[0.9, 0.999]]
|
1296 |
+
[2022-12-19 02:30:34,941] [INFO] [timer.py:196:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=17.600239248768926, CurrSamplesPerSec=17.787628920083662, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1297 |
+
{'loss': 0.0003, 'learning_rate': 3.3488888888888892e-06, 'epoch': 30.0}
|
1298 |
+
[2022-12-19 02:33:30,948] [INFO] [logging.py:68:log_dist] [Rank 0] step=3510, skipped=6, lr=[3.326666666666667e-06], mom=[[0.9, 0.999]]
|
1299 |
+
[2022-12-19 02:33:30,950] [INFO] [timer.py:196:stop] epoch=0/micro_step=3510/global_step=3510, RunningAvgSamplesPerSec=17.60011342639632, CurrSamplesPerSec=17.535298747587174, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1300 |
+
[2022-12-19 02:36:27,492] [INFO] [logging.py:68:log_dist] [Rank 0] step=3520, skipped=6, lr=[3.3044444444444445e-06], mom=[[0.9, 0.999]]
|
1301 |
+
[2022-12-19 02:36:27,493] [INFO] [timer.py:196:stop] epoch=0/micro_step=3520/global_step=3520, RunningAvgSamplesPerSec=17.60047227468804, CurrSamplesPerSec=17.78531311367044, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1302 |
+
{'loss': 0.0002, 'learning_rate': 3.2933333333333333e-06, 'epoch': 30.01}
|
1303 |
+
[2022-12-19 02:39:27,522] [INFO] [logging.py:68:log_dist] [Rank 0] step=3530, skipped=6, lr=[3.282222222222223e-06], mom=[[0.9, 0.999]]
|
1304 |
+
[2022-12-19 02:39:27,523] [INFO] [timer.py:196:stop] epoch=0/micro_step=3530/global_step=3530, RunningAvgSamplesPerSec=17.600453018305693, CurrSamplesPerSec=17.688962622702615, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1305 |
+
[2022-12-19 02:42:17,713] [INFO] [logging.py:68:log_dist] [Rank 0] step=3540, skipped=6, lr=[3.2600000000000006e-06], mom=[[0.9, 0.999]]
|
1306 |
+
[2022-12-19 02:42:17,714] [INFO] [timer.py:196:stop] epoch=0/micro_step=3540/global_step=3540, RunningAvgSamplesPerSec=17.600259041781936, CurrSamplesPerSec=17.62578056332801, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1307 |
+
[2022-12-19 02:45:09,551] [INFO] [logging.py:68:log_dist] [Rank 0] step=3550, skipped=6, lr=[3.237777777777778e-06], mom=[[0.9, 0.999]]
|
1308 |
+
[2022-12-19 02:45:09,553] [INFO] [timer.py:196:stop] epoch=0/micro_step=3550/global_step=3550, RunningAvgSamplesPerSec=17.60029625206999, CurrSamplesPerSec=17.671377775279126, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1309 |
+
{'loss': 0.0002, 'learning_rate': 3.237777777777778e-06, 'epoch': 30.01}
|
1310 |
+
[2022-12-19 02:48:06,078] [INFO] [logging.py:68:log_dist] [Rank 0] step=3560, skipped=6, lr=[3.2155555555555558e-06], mom=[[0.9, 0.999]]
|
1311 |
+
[2022-12-19 02:48:06,079] [INFO] [timer.py:196:stop] epoch=0/micro_step=3560/global_step=3560, RunningAvgSamplesPerSec=17.600329628889238, CurrSamplesPerSec=17.71231088469093, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1312 |
+
[2022-12-19 02:51:00,756] [INFO] [logging.py:68:log_dist] [Rank 0] step=3570, skipped=6, lr=[3.193333333333334e-06], mom=[[0.9, 0.999]]
|
1313 |
+
[2022-12-19 02:51:00,758] [INFO] [timer.py:196:stop] epoch=0/micro_step=3570/global_step=3570, RunningAvgSamplesPerSec=17.60022112966504, CurrSamplesPerSec=17.219200593227026, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1314 |
+
{'loss': 0.0002, 'learning_rate': 3.1822222222222226e-06, 'epoch': 30.02}
|
1315 |
+
[2022-12-19 02:53:51,858] [INFO] [logging.py:68:log_dist] [Rank 0] step=3580, skipped=6, lr=[3.1711111111111114e-06], mom=[[0.9, 0.999]]
|
1316 |
+
[2022-12-19 02:53:51,859] [INFO] [timer.py:196:stop] epoch=0/micro_step=3580/global_step=3580, RunningAvgSamplesPerSec=17.60064739187315, CurrSamplesPerSec=17.62833168565665, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1317 |
+
[2022-12-19 02:56:13,648] [INFO] [logging.py:68:log_dist] [Rank 0] step=3590, skipped=6, lr=[3.148888888888889e-06], mom=[[0.9, 0.999]]
|
1318 |
+
[2022-12-19 02:56:13,649] [INFO] [timer.py:196:stop] epoch=0/micro_step=3590/global_step=3590, RunningAvgSamplesPerSec=17.600825925087175, CurrSamplesPerSec=17.855607656356614, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1319 |
+
[2022-12-19 02:59:27,493] [INFO] [logging.py:68:log_dist] [Rank 0] step=3600, skipped=6, lr=[3.1266666666666667e-06], mom=[[0.9, 0.999]]
|
1320 |
+
[2022-12-19 02:59:27,494] [INFO] [timer.py:196:stop] epoch=0/micro_step=3600/global_step=3600, RunningAvgSamplesPerSec=17.601838241347792, CurrSamplesPerSec=17.581900003903662, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1321 |
+
{'loss': 0.0002, 'learning_rate': 3.1266666666666667e-06, 'epoch': 31.0}
|
1322 |
+
[2022-12-19 03:02:16,659] [INFO] [logging.py:68:log_dist] [Rank 0] step=3610, skipped=6, lr=[3.104444444444445e-06], mom=[[0.9, 0.999]]
|
1323 |
+
[2022-12-19 03:02:16,661] [INFO] [timer.py:196:stop] epoch=0/micro_step=3610/global_step=3610, RunningAvgSamplesPerSec=17.601819994166515, CurrSamplesPerSec=17.664054192793216, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1324 |
+
[2022-12-19 03:05:08,497] [INFO] [logging.py:68:log_dist] [Rank 0] step=3620, skipped=6, lr=[3.0822222222222227e-06], mom=[[0.9, 0.999]]
|
1325 |
+
[2022-12-19 03:05:08,498] [INFO] [timer.py:196:stop] epoch=0/micro_step=3620/global_step=3620, RunningAvgSamplesPerSec=17.60192426766313, CurrSamplesPerSec=17.875462043633274, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1326 |
+
{'loss': 0.0002, 'learning_rate': 3.0711111111111115e-06, 'epoch': 31.01}
|
1327 |
+
[2022-12-19 03:07:58,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=3630, skipped=6, lr=[3.0600000000000003e-06], mom=[[0.9, 0.999]]
|
1328 |
+
[2022-12-19 03:07:58,740] [INFO] [timer.py:196:stop] epoch=0/micro_step=3630/global_step=3630, RunningAvgSamplesPerSec=17.60179920974838, CurrSamplesPerSec=17.7938144302562, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1329 |
+
[2022-12-19 03:10:48,446] [INFO] [logging.py:68:log_dist] [Rank 0] step=3640, skipped=6, lr=[3.037777777777778e-06], mom=[[0.9, 0.999]]
|
1330 |
+
[2022-12-19 03:10:48,447] [INFO] [timer.py:196:stop] epoch=0/micro_step=3640/global_step=3640, RunningAvgSamplesPerSec=17.6017878320369, CurrSamplesPerSec=17.83189928619381, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1331 |
+
[2022-12-19 03:13:33,096] [INFO] [logging.py:68:log_dist] [Rank 0] step=3650, skipped=6, lr=[3.015555555555556e-06], mom=[[0.9, 0.999]]
|
1332 |
+
[2022-12-19 03:13:33,098] [INFO] [timer.py:196:stop] epoch=0/micro_step=3650/global_step=3650, RunningAvgSamplesPerSec=17.601565935375557, CurrSamplesPerSec=17.83511830383586, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1333 |
+
{'loss': 0.0002, 'learning_rate': 3.015555555555556e-06, 'epoch': 31.01}
|
1334 |
+
[2022-12-19 03:16:20,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=3660, skipped=6, lr=[2.9933333333333336e-06], mom=[[0.9, 0.999]]
|
1335 |
+
[2022-12-19 03:16:20,756] [INFO] [timer.py:196:stop] epoch=0/micro_step=3660/global_step=3660, RunningAvgSamplesPerSec=17.601526408905308, CurrSamplesPerSec=17.55972566487229, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1336 |
+
[2022-12-19 03:19:08,899] [INFO] [logging.py:68:log_dist] [Rank 0] step=3670, skipped=6, lr=[2.9711111111111112e-06], mom=[[0.9, 0.999]]
|
1337 |
+
[2022-12-19 03:19:08,901] [INFO] [timer.py:196:stop] epoch=0/micro_step=3670/global_step=3670, RunningAvgSamplesPerSec=17.601866020071387, CurrSamplesPerSec=17.7394257348548, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1338 |
+
{'loss': 0.0002, 'learning_rate': 2.96e-06, 'epoch': 31.02}
|
1339 |
+
[2022-12-19 03:22:05,005] [INFO] [logging.py:68:log_dist] [Rank 0] step=3680, skipped=6, lr=[2.948888888888889e-06], mom=[[0.9, 0.999]]
|
1340 |
+
[2022-12-19 03:22:05,006] [INFO] [timer.py:196:stop] epoch=0/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=17.602148084364135, CurrSamplesPerSec=17.71423246925165, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1341 |
+
[2022-12-19 03:25:01,391] [INFO] [logging.py:68:log_dist] [Rank 0] step=3690, skipped=6, lr=[2.9266666666666673e-06], mom=[[0.9, 0.999]]
|
1342 |
+
[2022-12-19 03:25:01,392] [INFO] [timer.py:196:stop] epoch=0/micro_step=3690/global_step=3690, RunningAvgSamplesPerSec=17.602265105171185, CurrSamplesPerSec=17.839620005142503, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1343 |
+
[2022-12-19 03:27:58,636] [INFO] [logging.py:68:log_dist] [Rank 0] step=3700, skipped=6, lr=[2.904444444444445e-06], mom=[[0.9, 0.999]]
|
1344 |
+
[2022-12-19 03:27:58,637] [INFO] [timer.py:196:stop] epoch=0/micro_step=3700/global_step=3700, RunningAvgSamplesPerSec=17.602024317328592, CurrSamplesPerSec=17.13545189933907, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1345 |
+
{'loss': 0.0002, 'learning_rate': 2.904444444444445e-06, 'epoch': 31.02}
|
1346 |
+
[2022-12-19 03:29:28,476] [INFO] [logging.py:68:log_dist] [Rank 0] step=3710, skipped=6, lr=[2.8822222222222225e-06], mom=[[0.9, 0.999]]
|
1347 |
+
[2022-12-19 03:29:28,478] [INFO] [timer.py:196:stop] epoch=0/micro_step=3710/global_step=3710, RunningAvgSamplesPerSec=17.60249366241964, CurrSamplesPerSec=17.683989065307227, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1348 |
+
[2022-12-19 03:33:40,636] [INFO] [logging.py:68:log_dist] [Rank 0] step=3720, skipped=6, lr=[2.86e-06], mom=[[0.9, 0.999]]
|
1349 |
+
[2022-12-19 03:33:40,637] [INFO] [timer.py:196:stop] epoch=0/micro_step=3720/global_step=3720, RunningAvgSamplesPerSec=17.603601659594958, CurrSamplesPerSec=17.493026261358214, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1350 |
+
{'loss': 0.0002, 'learning_rate': 2.8488888888888894e-06, 'epoch': 32.0}
|
1351 |
+
[2022-12-19 03:36:44,989] [INFO] [logging.py:68:log_dist] [Rank 0] step=3730, skipped=6, lr=[2.837777777777778e-06], mom=[[0.9, 0.999]]
|
1352 |
+
[2022-12-19 03:36:44,990] [INFO] [timer.py:196:stop] epoch=0/micro_step=3730/global_step=3730, RunningAvgSamplesPerSec=17.604015035004593, CurrSamplesPerSec=17.839234699560244, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1353 |
+
[2022-12-19 03:39:39,022] [INFO] [logging.py:68:log_dist] [Rank 0] step=3740, skipped=6, lr=[2.815555555555556e-06], mom=[[0.9, 0.999]]
|
1354 |
+
[2022-12-19 03:39:39,024] [INFO] [timer.py:196:stop] epoch=0/micro_step=3740/global_step=3740, RunningAvgSamplesPerSec=17.60423435088669, CurrSamplesPerSec=17.872316502218464, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1355 |
+
[2022-12-19 03:42:36,477] [INFO] [logging.py:68:log_dist] [Rank 0] step=3750, skipped=6, lr=[2.7933333333333334e-06], mom=[[0.9, 0.999]]
|
1356 |
+
[2022-12-19 03:42:36,479] [INFO] [timer.py:196:stop] epoch=0/micro_step=3750/global_step=3750, RunningAvgSamplesPerSec=17.60421286134035, CurrSamplesPerSec=17.6653445073596, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1357 |
+
{'loss': 0.0002, 'learning_rate': 2.7933333333333334e-06, 'epoch': 32.01}
|
1358 |
+
[2022-12-19 03:47:44,486] [INFO] [logging.py:68:log_dist] [Rank 0] step=3760, skipped=6, lr=[2.771111111111111e-06], mom=[[0.9, 0.999]]
|
1359 |
+
[2022-12-19 03:47:44,487] [INFO] [timer.py:196:stop] epoch=0/micro_step=3760/global_step=3760, RunningAvgSamplesPerSec=17.60421484650127, CurrSamplesPerSec=17.76200482461907, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1360 |
+
[2022-12-19 03:50:36,617] [INFO] [logging.py:68:log_dist] [Rank 0] step=3770, skipped=6, lr=[2.748888888888889e-06], mom=[[0.9, 0.999]]
|
1361 |
+
[2022-12-19 03:50:36,619] [INFO] [timer.py:196:stop] epoch=0/micro_step=3770/global_step=3770, RunningAvgSamplesPerSec=17.603986155075713, CurrSamplesPerSec=17.835418109567822, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1362 |
+
{'loss': 0.0002, 'learning_rate': 2.7377777777777783e-06, 'epoch': 32.01}
|
1363 |
+
[2022-12-19 03:53:34,204] [INFO] [logging.py:68:log_dist] [Rank 0] step=3780, skipped=6, lr=[2.726666666666667e-06], mom=[[0.9, 0.999]]
|
1364 |
+
[2022-12-19 03:53:34,206] [INFO] [timer.py:196:stop] epoch=0/micro_step=3780/global_step=3780, RunningAvgSamplesPerSec=17.604210162616486, CurrSamplesPerSec=17.369442334662356, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1365 |
+
[2022-12-19 03:56:31,105] [INFO] [logging.py:68:log_dist] [Rank 0] step=3790, skipped=6, lr=[2.7044444444444447e-06], mom=[[0.9, 0.999]]
|
1366 |
+
[2022-12-19 03:56:31,106] [INFO] [timer.py:196:stop] epoch=0/micro_step=3790/global_step=3790, RunningAvgSamplesPerSec=17.60431255398128, CurrSamplesPerSec=17.567898813891517, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1367 |
+
[2022-12-19 03:59:23,749] [INFO] [logging.py:68:log_dist] [Rank 0] step=3800, skipped=6, lr=[2.6822222222222223e-06], mom=[[0.9, 0.999]]
|
1368 |
+
[2022-12-19 03:59:23,750] [INFO] [timer.py:196:stop] epoch=0/micro_step=3800/global_step=3800, RunningAvgSamplesPerSec=17.60431104909582, CurrSamplesPerSec=17.70104341253724, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1369 |
+
{'loss': 0.0002, 'learning_rate': 2.6822222222222223e-06, 'epoch': 32.02}
|
1370 |
+
[2022-12-19 04:02:19,616] [INFO] [logging.py:68:log_dist] [Rank 0] step=3810, skipped=6, lr=[2.6600000000000004e-06], mom=[[0.9, 0.999]]
|
1371 |
+
[2022-12-19 04:02:19,617] [INFO] [timer.py:196:stop] epoch=0/micro_step=3810/global_step=3810, RunningAvgSamplesPerSec=17.604126410600333, CurrSamplesPerSec=17.600326048215376, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1372 |
+
[2022-12-19 04:05:13,803] [INFO] [logging.py:68:log_dist] [Rank 0] step=3820, skipped=6, lr=[2.637777777777778e-06], mom=[[0.9, 0.999]]
|
1373 |
+
[2022-12-19 04:05:13,805] [INFO] [timer.py:196:stop] epoch=0/micro_step=3820/global_step=3820, RunningAvgSamplesPerSec=17.60442226776567, CurrSamplesPerSec=17.521556925430364, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1374 |
+
{'loss': 0.0002, 'learning_rate': 2.6266666666666668e-06, 'epoch': 32.02}
|
1375 |
+
[2022-12-19 04:08:01,470] [INFO] [logging.py:68:log_dist] [Rank 0] step=3830, skipped=6, lr=[2.6155555555555556e-06], mom=[[0.9, 0.999]]
|
1376 |
+
[2022-12-19 04:08:01,472] [INFO] [timer.py:196:stop] epoch=0/micro_step=3830/global_step=3830, RunningAvgSamplesPerSec=17.605964489961178, CurrSamplesPerSec=17.66188200789337, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1377 |
+
[2022-12-19 04:10:57,833] [INFO] [logging.py:68:log_dist] [Rank 0] step=3840, skipped=6, lr=[2.5933333333333336e-06], mom=[[0.9, 0.999]]
|
1378 |
+
[2022-12-19 04:10:57,835] [INFO] [timer.py:196:stop] epoch=0/micro_step=3840/global_step=3840, RunningAvgSamplesPerSec=17.60612382198526, CurrSamplesPerSec=17.728355572744416, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1379 |
+
[2022-12-19 04:13:53,027] [INFO] [logging.py:68:log_dist] [Rank 0] step=3850, skipped=6, lr=[2.5711111111111112e-06], mom=[[0.9, 0.999]]
|
1380 |
+
[2022-12-19 04:13:53,028] [INFO] [timer.py:196:stop] epoch=0/micro_step=3850/global_step=3850, RunningAvgSamplesPerSec=17.606310071771578, CurrSamplesPerSec=17.81534865653578, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1381 |
+
{'loss': 0.0002, 'learning_rate': 2.5711111111111112e-06, 'epoch': 33.0}
|
1382 |
+
[2022-12-19 04:16:48,115] [INFO] [logging.py:68:log_dist] [Rank 0] step=3860, skipped=6, lr=[2.5488888888888893e-06], mom=[[0.9, 0.999]]
|
1383 |
+
[2022-12-19 04:16:48,116] [INFO] [timer.py:196:stop] epoch=0/micro_step=3860/global_step=3860, RunningAvgSamplesPerSec=17.60637394058431, CurrSamplesPerSec=17.27665271608909, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1384 |
+
[2022-12-19 04:19:44,284] [INFO] [logging.py:68:log_dist] [Rank 0] step=3870, skipped=6, lr=[2.526666666666667e-06], mom=[[0.9, 0.999]]
|
1385 |
+
[2022-12-19 04:19:44,286] [INFO] [timer.py:196:stop] epoch=0/micro_step=3870/global_step=3870, RunningAvgSamplesPerSec=17.606653162319823, CurrSamplesPerSec=17.38970420612655, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1386 |
+
{'loss': 0.0002, 'learning_rate': 2.5155555555555557e-06, 'epoch': 33.01}
|
1387 |
+
[2022-12-19 04:22:40,524] [INFO] [logging.py:68:log_dist] [Rank 0] step=3880, skipped=6, lr=[2.504444444444445e-06], mom=[[0.9, 0.999]]
|
1388 |
+
[2022-12-19 04:22:40,525] [INFO] [timer.py:196:stop] epoch=0/micro_step=3880/global_step=3880, RunningAvgSamplesPerSec=17.606719113033467, CurrSamplesPerSec=17.757985076265896, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1389 |
+
[2022-12-19 04:25:38,514] [INFO] [logging.py:68:log_dist] [Rank 0] step=3890, skipped=6, lr=[2.4822222222222225e-06], mom=[[0.9, 0.999]]
|
1390 |
+
[2022-12-19 04:25:38,516] [INFO] [timer.py:196:stop] epoch=0/micro_step=3890/global_step=3890, RunningAvgSamplesPerSec=17.60689324457354, CurrSamplesPerSec=17.80211727021623, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1391 |
+
[2022-12-19 04:28:36,061] [INFO] [logging.py:68:log_dist] [Rank 0] step=3900, skipped=6, lr=[2.46e-06], mom=[[0.9, 0.999]]
|
1392 |
+
[2022-12-19 04:28:36,063] [INFO] [timer.py:196:stop] epoch=0/micro_step=3900/global_step=3900, RunningAvgSamplesPerSec=17.607112140420526, CurrSamplesPerSec=17.612398120420085, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1393 |
+
{'loss': 0.0002, 'learning_rate': 2.46e-06, 'epoch': 33.01}
|
1394 |
+
[2022-12-19 04:31:31,396] [INFO] [logging.py:68:log_dist] [Rank 0] step=3910, skipped=6, lr=[2.437777777777778e-06], mom=[[0.9, 0.999]]
|
1395 |
+
[2022-12-19 04:31:31,398] [INFO] [timer.py:196:stop] epoch=0/micro_step=3910/global_step=3910, RunningAvgSamplesPerSec=17.607239439571316, CurrSamplesPerSec=17.700039648390714, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1396 |
+
[2022-12-19 04:34:29,911] [INFO] [logging.py:68:log_dist] [Rank 0] step=3920, skipped=6, lr=[2.415555555555556e-06], mom=[[0.9, 0.999]]
|
1397 |
+
[2022-12-19 04:34:29,912] [INFO] [timer.py:196:stop] epoch=0/micro_step=3920/global_step=3920, RunningAvgSamplesPerSec=17.60749252768418, CurrSamplesPerSec=17.79277653010592, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1398 |
+
{'loss': 0.0002, 'learning_rate': 2.4044444444444446e-06, 'epoch': 33.02}
|
1399 |
+
[2022-12-19 04:37:31,484] [INFO] [logging.py:68:log_dist] [Rank 0] step=3930, skipped=6, lr=[2.3933333333333334e-06], mom=[[0.9, 0.999]]
|
1400 |
+
[2022-12-19 04:37:31,485] [INFO] [timer.py:196:stop] epoch=0/micro_step=3930/global_step=3930, RunningAvgSamplesPerSec=17.607728287544596, CurrSamplesPerSec=17.50504171864284, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1401 |
+
[2022-12-19 04:39:30,333] [INFO] [logging.py:68:log_dist] [Rank 0] step=3940, skipped=6, lr=[2.371111111111111e-06], mom=[[0.9, 0.999]]
|
1402 |
+
[2022-12-19 04:39:30,335] [INFO] [timer.py:196:stop] epoch=0/micro_step=3940/global_step=3940, RunningAvgSamplesPerSec=17.60803244226165, CurrSamplesPerSec=17.897954595339627, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1403 |
+
[2022-12-19 04:43:16,008] [INFO] [logging.py:68:log_dist] [Rank 0] step=3950, skipped=6, lr=[2.348888888888889e-06], mom=[[0.9, 0.999]]
|
1404 |
+
[2022-12-19 04:43:16,010] [INFO] [timer.py:196:stop] epoch=0/micro_step=3950/global_step=3950, RunningAvgSamplesPerSec=17.609317404387298, CurrSamplesPerSec=17.338718477525717, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1405 |
+
{'loss': 0.0002, 'learning_rate': 2.348888888888889e-06, 'epoch': 34.0}
|
1406 |
+
[2022-12-19 04:46:17,542] [INFO] [logging.py:68:log_dist] [Rank 0] step=3960, skipped=6, lr=[2.3266666666666667e-06], mom=[[0.9, 0.999]]
|
1407 |
+
[2022-12-19 04:46:17,543] [INFO] [timer.py:196:stop] epoch=0/micro_step=3960/global_step=3960, RunningAvgSamplesPerSec=17.609609085156343, CurrSamplesPerSec=17.51949968202784, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1408 |
+
[2022-12-19 04:49:16,425] [INFO] [logging.py:68:log_dist] [Rank 0] step=3970, skipped=6, lr=[2.3044444444444447e-06], mom=[[0.9, 0.999]]
|
1409 |
+
[2022-12-19 04:49:16,426] [INFO] [timer.py:196:stop] epoch=0/micro_step=3970/global_step=3970, RunningAvgSamplesPerSec=17.609706216869625, CurrSamplesPerSec=17.619688175337636, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1410 |
+
{'loss': 0.0002, 'learning_rate': 2.2933333333333335e-06, 'epoch': 34.01}
|
1411 |
+
[2022-12-19 04:52:13,918] [INFO] [logging.py:68:log_dist] [Rank 0] step=3980, skipped=6, lr=[2.2822222222222223e-06], mom=[[0.9, 0.999]]
|
1412 |
+
[2022-12-19 04:52:13,919] [INFO] [timer.py:196:stop] epoch=0/micro_step=3980/global_step=3980, RunningAvgSamplesPerSec=17.60986454647832, CurrSamplesPerSec=17.457695134212692, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1413 |
+
[2022-12-19 04:55:16,822] [INFO] [logging.py:68:log_dist] [Rank 0] step=3990, skipped=6, lr=[2.2600000000000004e-06], mom=[[0.9, 0.999]]
|
1414 |
+
[2022-12-19 04:55:16,825] [INFO] [timer.py:196:stop] epoch=0/micro_step=3990/global_step=3990, RunningAvgSamplesPerSec=17.610149442844655, CurrSamplesPerSec=17.861005041208294, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1415 |
+
[2022-12-19 04:58:14,067] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=6, lr=[2.237777777777778e-06], mom=[[0.9, 0.999]]
|
1416 |
+
[2022-12-19 04:58:14,068] [INFO] [timer.py:196:stop] epoch=0/micro_step=4000/global_step=4000, RunningAvgSamplesPerSec=17.61035876229219, CurrSamplesPerSec=17.883168119082438, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
1417 |
+
{'loss': 0.0002, 'learning_rate': 2.237777777777778e-06, 'epoch': 34.01}
|
1418 |
+
{'eval_loss': 0.338134765625, 'eval_wer': 17.7977496284764, 'eval_runtime': 1277.681, 'eval_samples_per_second': 3.02, 'eval_steps_per_second': 0.095, 'epoch': 34.01}
|
1419 |
+
[2022-12-19 05:19:32,876] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step4000 is begin to save!
|
1420 |
+
[2022-12-19 05:19:32,885] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt
|
1421 |
+
[2022-12-19 05:19:32,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt...
|
1422 |
+
[2022-12-19 05:19:33,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt.
|
1423 |
+
[2022-12-19 05:19:33,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
|
1424 |
+
[2022-12-19 05:19:38,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1425 |
+
[2022-12-19 05:19:38,828] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1426 |
+
[2022-12-19 05:19:38,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
|
runs/Dec18_08-41-04_fe2747a042f0/events.out.tfevents.1671381730.fe2747a042f0.46148.0
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f809800ca6de0a5be418d3f2830d96d4dc56d4a9ab41574b9b5ebe7730f0eee9
|
3 |
+
size 30653
|