About the `rope_theta` values of `Yi-34B` and `Yi-34B-200k`
A fantastic open-source endeavor!
I'm puzzled by a few aspects:
Firstly, why do both Yi-34B and Yi-34B-200k have such large rope_theta values (5,000,000 and 10,000,000 respectively) in their config.json files? Moreover, before the latest update, Yi-34B-200k even shared the same rope_theta value as Yi-34B. Typically, in line with other open-source projects, shouldn't the rope_theta of a base model be around 10,000? I'm also keen to understand the rope_theta values and training seq_len used during the pre-training and window extrapolation stages for the Yi-34B(-200k) model. Unfortunately, this information wasn't provided in the recently-released technical report.
Additionally, there might be a typo in your report. You mentioned:
We continue to pretrain the model on 5B tokens with a 4M batch size, which translates to 100 optimization steps.
However, shouldn't it be 5000M / 4M, resulting in 1250 steps?
Thank you.
@horiz94 In terms of the large rope_theta values, I can only say that it is a decision that we made after many trials and careful considerations. For any further information, I can only provide you as much as the report does.
In terms of the typo, you are right, I will take a look if there is anything our team can do about it as of now. Thank you for your support.