Need help to debug my training process
Hello fellows,
My friend and me, we're fine-tuning the model with our dataset. The task is very heavy for our PCs, then we passed to SageMaker. Then, we have some questions :
- Firstly, I would like to know if it's normal to take 5h to train it in a ml.g5.24xlarge instance, mainly because, for testing, we're using a very small dataset (ten audio files).
- Is it necessary to have all the demo files ? How could we understand better the params from demo_cfg ?
- Is there any step that we did that is not necessary - and, maybe, is causing the heavy computations ? Batch sizes, gpus, cuda stuff, etc.
We're attaching all the process of our training, to help the collective debugging.
a) the model archi
In jupter notebooks:
b) first imports
c) model loading
d) Cuda import and training prompt
OUR TRAINING PROMPT :
!python stable-audio-tools-sam/train.py --model-config stable_open_model_files/model_config.json --dataset-config stable_open_model_files/dataset_config.json --name rayan-training --save-dir checkpoints --pretrained-ckpt-path stable_open_model_files/model.safetensors --batch-size 16 --num-gpus 4 --strategy deepspeed
Outputs:
e) Models loaded
f) Some warnings and cuda loading
g) Training in action
h) After 5h without conclusion, our keyboard interruption...
We can, eventually, put the logs from sagemaker here too.
Thanks in advance !