fdschmidt93/NLLB-LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse · Clarification regarding the stage of model training

The README mentions the below, which hopefully should make it clear it has only been trained on Stage-1.

This model has only been trained on self-supervised data and not yet been fine-tuned on any downstream task!

It nevertheless should notably perform better than stage 1 in the paper as it has been trained

for much, much longer
with explicit token alignment between LLM2Vec and NLLB-LLM2Vec
with a least squares initialized up projection that improves convergence speed and fit
and released with merged LoRAs -- which strongly helps in few shot tasks like NusaX

I unfortunately won't be releasing stage 2 models as I will not be finetuning models per task due to lack of time and compute.

There would be an argument to do something like GritLM (cf. Paper) and then do distillation to have a single model for 'all tasks', but I don't have the capacity (GPUs, time) to do that. I invested a lot of time in improving the self-supervised stage as much as possible.

NLLB-LLM2Vec should be used if you need sequence level embeddings for less resourced languages that industry-level models (samples, supervision, etc.) like NVEmbed, GritLM, E5, BGE don't cover. Or in actually academic settings where you want to be more sure that the task has not leaked (albeit instruction finetuning of Llama itself may have leakage).

I hope this clarifies any questions you might have. Let me know if there's more follow up you would like to discuss.