Reproducing the fine tuning gets stuck with 100% CPU on one process
#20
by
felipemv
- opened
Hi, I'm trying to reproduce your results, but at the early stages there seems to be a stuck process.
echo '
{
"fp16": {
"enabled": true,
β― (identical to yours)
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
' > ./ds_config.json
deepspeed \
./trainer_sft.py \
--configs defaults reference-data reference-pythia-12b \
--cache_dir /root/.cache/huggingface \
--output_dir .saved/oasst-sft-3-pythia-12b-reference_2kpre \
--num_train_epochs 8 \
--use_flash_attention false \
--verbose true \
--logging_steps 1 \
--dtype fp16 \
--residual_dropout 0.2 \
--model_name andreaskoepf/pythia-12b-pre-2000
So I get the following logs (abbreviated):
Evaluation set sizes:
oasst_export: 2026 (16.55%)
alpaca: 10212 (83.45%)
Total eval: 12238
--------------------------------------------------------------------------------
β―
Number of trainable parameters: 11841M
Loading checkpoint shards: 100%|ββββββββββ| 3/3 [00:17<00:00, 5.83s/it]
Resizing embeddings to 50288
β―
warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
I get a burst of GPU activity some 3 minutes after starting the process. It lasts for about 10 seconds, then it halts completely and I get stuck with a single process using 100% of a CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
βββββββ ββββ ββ β βββββ βββββ ββββββ β ββββ βββ βββββββββ /usr/local/bin/python3 -u ./trainer_sft.py --local_rank=6
Do you have any idea what might that be?
Let me know if more logs/info would help. I'm using 8 GPUs which should fit this model comfortably in memory.