How did you train m3-retromae?
Hello bge-m3 Great approach! It performs well and I love it. Thank you for the great models and papers published.
I would like to know about the 8192 tokens support of XLM-Roberta, as I could not read it from the paper.
Is it correct that you first set the max_position_embeddings of XLM-Roberta to 8194 and then created a bge-m3-retromae trained with long token sentences in RetroMAE?
I would also appreciate if you could tell me what training dataset you used at that time, if possible.
Thanks for your attention to our work!
We extend the max_position_embeddings of XLM-Roberta to 8194 and train this model on pile, mc4, and wudao datasets with retromae loss.
For the details of pre-training, you can refer to Appendix.B.1 in our paper.
Thank you!
I have also read Appendix.B.1, which deepened my understanding. I'm very grateful.
How did you extent the positional embeddings to 8192 exactly? Did you randomly initialize the new embeddings past 512? Or use some interpolation technique based on the original pretrained positional embedding?