rm_30k / stdout.log
Jayfeather1024's picture
update
f8ce820
raw
history blame
4.32 kB
[2024-01-05 20:02:29,789] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-05 20:02:34,703] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2024-01-05 20:02:34,703] [INFO] [runner.py:571:main] cmd = /data/jiongxiao_wang/anaconda3/envs/safe-rlhf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=47607 --module --enable_each_rank_log=None safe_rlhf.values.reward --train_datasets PKU-SafeRLHF/train:1.0:PKU-SafeRLHF-harmless-only-30k --eval_datasets PKU-SafeRLHF/test --model_name_or_path output/sft --max_length 512 --trust_remote_code True --loss_type sequence-wise --epochs 2 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --gradient_accumulation_steps 2 --gradient_checkpointing --normalize_score_during_training False --normalizer_type ExponentialMovingAverage --normalizer_momentum 0.9 --learning_rate 2e-5 --lr_scheduler_type cosine --lr_warmup_ratio 0.03 --weight_decay 0.1 --seed 42 --eval_strategy epoch --output_dir /data/jiongxiao_wang/rlhf_attack/safe-rlhf/output/rm_30k --log_type wandb --log_project Safe-RLHF-RM --zero_stage 3 --bf16 True --tf32 True
[2024-01-05 20:02:36,745] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-05 20:02:39,776] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-01-05 20:02:39,776] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-01-05 20:02:39,776] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-01-05 20:02:39,776] [INFO] [launch.py:163:main] dist_world_size=4
[2024-01-05 20:02:39,776] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-01-05 20:02:42,171] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-05 20:02:42,172] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-05 20:02:42,174] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-05 20:02:42,175] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-05 20:02:52,431] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-05 20:02:52,431] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-01-05 20:02:52,437] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-05 20:02:52,442] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-05 20:02:52,443] [INFO] [comm.py:637:init_distributed] cdb=None
Set logger level to WARNING.
ninja: no work to do.
Time to load fused_adam op: 0.13281846046447754 seconds
Time to load fused_adam op: 0.20154452323913574 seconds
Time to load fused_adam op: 0.2013847827911377 seconds
Time to load fused_adam op: 0.2015242576599121 seconds
Parameter Offload: Total persistent parameters: 270336 in 66 params
***** Running training *****
Saving model to "/data/jiongxiao_wang/rlhf_attack/safe-rlhf/output/rm_30k" ...
Saving DeepSpeed Checkpoints...
Converting DeepSpeed Checkpoints to Hugging Face format...
[2024-01-05 20:56:50,901] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint './global_step420'
Detected checkpoint of type zero stage 3, world_size: 4
Parsing checkpoint created by deepspeed==0.12.6
Reconstructed Trainable fp32 state dict with 291 params 6607351808 elements
Saving fp32 state dict to pytorch_model.bin
Model saved!
[2024-01-05 20:58:03,159] [INFO] [launch.py:347:main] Process 250383 exits successfully.
[2024-01-05 20:58:03,159] [INFO] [launch.py:347:main] Process 250381 exits successfully.
[2024-01-05 20:58:04,160] [INFO] [launch.py:347:main] Process 250382 exits successfully.
[2024-01-05 20:58:11,167] [INFO] [launch.py:347:main] Process 250380 exits successfully.