Clarification regarding the stage of model training
Can you please clarify if this is a stage-1 trained model or if the model has undergone both Stage-1 and stage-2 training?
The README mentions the below, which hopefully should make it clear it has only been trained on Stage-1.
This model has only been trained on self-supervised data and not yet been fine-tuned on any downstream task!
It nevertheless should notably perform better than stage 1 in the paper as it has been trained
- for much, much longer
- with explicit token alignment between LLM2Vec and NLLB-LLM2Vec
- with a least squares initialized up projection that improves convergence speed and fit
- and released with merged LoRAs -- which strongly helps in few shot tasks like NusaX
I unfortunately won't be releasing stage 2 models as I will not be finetuning models per task due to lack of time and compute.
There would be an argument to do something like GritLM (cf. Paper) and then do distillation to have a single model for 'all tasks', but I don't have the capacity (GPUs, time) to do that. I invested a lot of time in improving the self-supervised stage as much as possible.
NLLB-LLM2Vec should be used if you need sequence level embeddings for less resourced languages that industry-level models (samples, supervision, etc.) like NVEmbed, GritLM, E5, BGE don't cover. Or in actually academic settings where you want to be more sure that the task has not leaked (albeit instruction finetuning of Llama itself may have leakage).
I hope this clarifies any questions you might have. Let me know if there's more follow up you would like to discuss.