yukiontheiceberg commited on
Commit
7520d45
1 Parent(s): d51728b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -0
README.md CHANGED
@@ -6,6 +6,7 @@ license: apache-2.0
6
  We encountered two major loss spikes while [training K2](https://huggingface.co/LLM360/K2).
7
  * The [first loss spike](https://huggingface.co/LLM360/K2-Spike-1/) occured after 160 checkpoints and lasted over ~34 checkpoints. We restarted training at checkpoint 160 and training returned to normal.
8
  * The second loss spike occured after restarting training to fix the first loss spike at checkpoint 186 and lasted from ~8 checkpoints.
 
9
 
10
  We are releasing these checkpoints so others can study this interesting phenomena in large model training.
11
  <img src="loss_spike.png" alt="k2 loss spikes"/>
 
6
  We encountered two major loss spikes while [training K2](https://huggingface.co/LLM360/K2).
7
  * The [first loss spike](https://huggingface.co/LLM360/K2-Spike-1/) occured after 160 checkpoints and lasted over ~34 checkpoints. We restarted training at checkpoint 160 and training returned to normal.
8
  * The second loss spike occured after restarting training to fix the first loss spike at checkpoint 186 and lasted from ~8 checkpoints.
9
+ * For every spike checkpoint, we also uploaded the corresponding normal checkpoint for easy comparison. You could find different checkpoints in different branches.
10
 
11
  We are releasing these checkpoints so others can study this interesting phenomena in large model training.
12
  <img src="loss_spike.png" alt="k2 loss spikes"/>