training XTTSv2 leads to weird training lags

#67
by kononoff174 - opened

Hello, training XTTSv2 leads to weird training lags - training gets stuck with no errors

with using DDP
x6 RTX a6000 and 512GB RAM
Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)
image.png

Without DDP
image.png

Tried different dataset sizes - 2500hrs, 250hrs - result remains the same

I think there's some kind of error in Trainer or in xtts scripts maybe, don't know where to dig, thank you
no swap memory usage, no cpu overloading, no RAM overloading (by clearml, htop and top at least)
disk is fast NVME

To Reproduce
python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5
python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1
python3 recipes/ljspeech/xtts_v2/train_gpt_xtts.py

tried num_workers=0, >0, MP_THREADS_NUM and so on, nothing helps
lots of ram and shared memory

Sign up or log in to comment