how train in my domain
Is there any code to continue training?
Any framework for dense retriever training will do the job. We use the codebase at https://github.com/microsoft/unilm/tree/master/simlm
Hi Friend! a little complex understand that framework(https://github.com/microsoft/unilm/tree/master/simlm). Maybe if you have an easier example to cath the idea would be great. I would like to keep training this model in my own domain.
Thanks in advance bro!
I was able to finetune this model quite well on a domain specific task for information retrieval using FlagEmbeddings finetuning methods. You can find the finetuning examples at https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune.
@pascalhuerten , how many GPU did you use? how much time does it take? is is possible to train it on google colab pro A100(40gb vram)?
@wilfoderek Just a single T4 GPU with 15 GB VRAM was enough to fine-tune this model in about 15 minutes on 2000 data points. Fine-tuning on 26000 triplets took about an hour, so it shouldn’t be a problem for an A100. 😊
FYI: My goal was to fine-tune this model for the task of quickly retrieving the most relevant skills based on a database of over 13,000 skills for course descriptions in German language. By fine-tuning on the smaller dataset, I was able to increase the Mean Reciprocal Rank (MRR@10) in the specific domain from 0.32 to 0.69, which is a significant improvement! So fine-tuning is definitely recommended. Even a dataset of just 250 triplets showed notable enhancements. Additionally, I also fine-tuned bge_reranker_base on the same dataset, which further increased the MRR to 0.74. The only other embedding model that performed even better for me on the same dataset was BAAI/bge-m3, but it also takes four times the time to compute an embedding compared to intfloat/multilingual-e5-base.
I used the following training parameters:
torchrun --nproc_per_node 1 \
-m FlagEmbedding.baai_general_embedding.finetune.run \
--output_dir multilingual_e5_base_finetuned \
--model_name_or_path intfloat/multilingual-e5-base \
--train_data ./course_competency_alignment_de.jsonl \
--learning_rate 1e-5 \
--fp16 \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--dataloader_drop_last True \
--normlized True \
--temperature 0.02 \
--query_max_len 512 \
--passage_max_len 64 \
--train_group_size 4 \
--negatives_cross_device \
--logging_steps 10 \
--save_steps 1500 \
--query_instruction_for_retrieval ""
@pascalhuerten I am so grateful for your help! It is truly very valuable.