checkpoint saving when fine tune

#91
by NhatNguyen2101 - opened

this is my fine-tuning command on custom dataset:
!torchrun --nproc_per_node 2
-m FlagEmbedding.finetune.embedder.encoder_only.m3
--output_dir /kaggle/working/
--model_name_or_path BAAI/bge-m3
--train_data /kaggle/working/train_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 2 \

--dataloader_drop_last True \

--normlized True \

--temperature 0.02 \

--query_max_len 64 \

--passage_max_len 256 \

--train_group_size 2
--negatives_cross_device
--logging_steps 999999999
--save_strategy "epoch"
--save_total_limit 1 \

--same_task_within_batch True \

--unified_finetuning True
--use_self_distill True
--deepspeed /kaggle/working/ds_config.json
the problem is that why it keeps saving the check points 500, 1000, 1500... in the output directory? Each save contains optimizer.pt file and these files alone consumed way too much disk memory. I cannot fine-tune the model properly because of that. How can i stop that saving?
Screen Shot 2024-11-11 at 05.13.22.png

Sign up or log in to comment