diff --git "a/stderr.log" "b/stderr.log" new file mode 100644--- /dev/null +++ "b/stderr.log" @@ -0,0 +1,92 @@ ++ deepspeed --num_nodes=1 --num_gpus=4 --master_port 47607 --module safe_rlhf.values.reward --train_datasets PKU-SafeRLHF/train:1.0:PKU-SafeRLHF-harmless-only-30k --eval_datasets PKU-SafeRLHF/test --model_name_or_path output/sft --max_length 512 --trust_remote_code True --loss_type sequence-wise --epochs 2 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --gradient_accumulation_steps 2 --gradient_checkpointing --normalize_score_during_training False --normalizer_type ExponentialMovingAverage --normalizer_momentum 0.9 --learning_rate 2e-5 --lr_scheduler_type cosine --lr_warmup_ratio 0.03 --weight_decay 0.1 --seed 42 --eval_strategy epoch --output_dir /data/jiongxiao_wang/rlhf_attack/safe-rlhf/output/rm_30k --log_type wandb --log_project Safe-RLHF-RM --zero_stage 3 --bf16 True --tf32 True +2024-01-05 20:02:46.835068: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered +2024-01-05 20:02:46.835067: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered +2024-01-05 20:02:46.835067: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered +2024-01-05 20:02:46.835114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered +2024-01-05 20:02:46.835114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered +2024-01-05 20:02:46.835114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered +2024-01-05 20:02:46.835826: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered +2024-01-05 20:02:46.835865: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered +2024-01-05 20:02:46.836421: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered +2024-01-05 20:02:46.836422: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered +2024-01-05 20:02:46.836424: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered +2024-01-05 20:02:46.836771: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered +2024-01-05 20:02:48.497891: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT +2024-01-05 20:02:48.498124: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT +2024-01-05 20:02:48.498360: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT +2024-01-05 20:02:48.498588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT +Some weights of the model checkpoint at output/sft were not used when initializing LlamaModelForScore: ['lm_head.weight'] +- This IS expected if you are initializing LlamaModelForScore from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). +- This IS NOT expected if you are initializing LlamaModelForScore from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). +Some weights of the model checkpoint at output/sft were not used when initializing LlamaModelForScore: ['lm_head.weight'] +- This IS expected if you are initializing LlamaModelForScore from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). +- This IS NOT expected if you are initializing LlamaModelForScore from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). +Some weights of the model checkpoint at output/sft were not used when initializing LlamaModelForScore: ['lm_head.weight'] +- This IS expected if you are initializing LlamaModelForScore from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). +- This IS NOT expected if you are initializing LlamaModelForScore from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). +Some weights of LlamaModelForScore were not initialized from the model checkpoint at output/sft and are newly initialized: ['normalizer.count', 'normalizer.mean', 'normalizer.var'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +Some weights of LlamaModelForScore were not initialized from the model checkpoint at output/sft and are newly initialized: ['normalizer.count', 'normalizer.var', 'normalizer.mean'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +Some weights of LlamaModelForScore were not initialized from the model checkpoint at output/sft and are newly initialized: ['normalizer.mean', 'normalizer.var', 'normalizer.count'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +You are using the legacy behaviour of the . This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 +You are using the legacy behaviour of the . This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 +You are using the legacy behaviour of the . This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 +Some weights of the model checkpoint at output/sft were not used when initializing LlamaModelForScore: ['lm_head.weight'] +- This IS expected if you are initializing LlamaModelForScore from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). +- This IS NOT expected if you are initializing LlamaModelForScore from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). +Some weights of LlamaModelForScore were not initialized from the model checkpoint at output/sft and are newly initialized: ['normalizer.var', 'normalizer.mean', 'normalizer.count'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +You are using the legacy behaviour of the . This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 +Using /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... +Using /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... +Using /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... +Using /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... +Detected CUDA files, patching ldflags +Emitting ninja build file /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... +Building extension module fused_adam... +Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) +Loading extension module fused_adam... +Loading extension module fused_adam... +Loading extension module fused_adam... +Loading extension module fused_adam... +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +wandb: Currently logged in as: jayfeather (jayfeather1024). Use `wandb login --relogin` to force relogin +wandb: Tracking run with wandb version 0.16.1 +wandb: Run data is saved locally in /data/jiongxiao_wang/rlhf_attack/safe-rlhf/output/rm_30k/wandb/run-20240105_200327-0bh9htd8 +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run reward-2024-01-05-20-03-25 +wandb: ⭐️ View project at https://wandb.ai/jayfeather1024/Safe-RLHF-RM +wandb: 🚀 View run at https://wandb.ai/jayfeather1024/Safe-RLHF-RM/runs/0bh9htd8 + Training 1/2 epoch: 0%| | 0/840 [00:00