slurm submission log: 2024-05-08 15:15:14.383115 created following sbatch script: ############################### #!/bin/bash #SBATCH --account=nlp #SBATCH --cpus-per-task=16 #SBATCH --dependency=afterok:7590680 #SBATCH --gres=gpu:2 #SBATCH --job-name=tthrush-job-3657394 #SBATCH --mem=400G #SBATCH --nodelist=sphinx2 #SBATCH --open-mode=append #SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en/train_job_output.txt #SBATCH --partition=sphinx #SBATCH --time=14-0 # activate your desired anaconda environment . /nlp/scr/tthrush/miniconda3/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection # cd to working directory cd . # launch commands srun --unbuffered run_as_child_processes 'torchrun --master_port 29501 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/training_data/xnli_en --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en --output_hub_id pythia-70m_xnli_en --model_id /pythia-70m --num_train_epochs 1 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2' ############################### submission to slurm complete! ############################### slurm submission output Submitted batch job 7590683 ############################### ############################### start time: 2024-05-08 16:29:41.045821 machine: sphinx2 conda env: pretraining-coreset-selection ############################### running following processes torchrun --master_port 29501 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/training_data/xnli_en --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en --output_hub_id pythia-70m_xnli_en --model_id /pythia-70m --num_train_epochs 1 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2 ############################### command outputs: [2024-05-08 16:29:42,576] torch.distributed.run: [WARNING] [2024-05-08 16:29:42,576] torch.distributed.run: [WARNING] ***************************************** [2024-05-08 16:29:42,576] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-05-08 16:29:42,576] torch.distributed.run: [WARNING] ***************************************** 05/08/2024 16:29:46 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/training_data/xnli_en', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en', output_hub_id='pythia-70m_xnli_en', hf_hub_token=True, model_id='/pythia-70m', per_device_train_batch_size=256, num_train_epochs=1, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False) 05/08/2024 16:29:46 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/training_data/xnli_en', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en', output_hub_id='pythia-70m_xnli_en', hf_hub_token=True, model_id='/pythia-70m', per_device_train_batch_size=256, num_train_epochs=1, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False) Traceback (most recent call last): File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file resolved_file = hf_hub_download( File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn validate_repo_id(arg_value) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 164, in validate_repo_id raise HFValidationError( huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/pythia-70m'. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_llm.py", line 202, in train_model() File "/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_llm.py", line 166, in train_model config = AutoConfig.from_pretrained(script_args.model_id) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1138, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict resolved_config_file = cached_file( File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/utils/hub.py", line 462, in cached_file raise EnvironmentError( OSError: Incorrect path_or_model_id: '/pythia-70m'. Please provide either the path to a local folder or the repo_id of a model on the Hub. Traceback (most recent call last): File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file resolved_file = hf_hub_download( File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn validate_repo_id(arg_value) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 164, in validate_repo_id raise HFValidationError( huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/pythia-70m'. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_llm.py", line 202, in train_model() File "/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_llm.py", line 166, in train_model config = AutoConfig.from_pretrained(script_args.model_id) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1138, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict resolved_config_file = cached_file( File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/utils/hub.py", line 462, in cached_file raise EnvironmentError( OSError: Incorrect path_or_model_id: '/pythia-70m'. Please provide either the path to a local folder or the repo_id of a model on the Hub. [2024-05-08 16:29:52,595] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2307807) of binary: /nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/bin/python Traceback (most recent call last): File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/bin/torchrun", line 8, in sys.exit(main()) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_llm.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-05-08_16:29:52 host : sphinx2.stanford.edu rank : 1 (local_rank: 1) exitcode : 1 (pid: 2307808) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-08_16:29:52 host : sphinx2.stanford.edu rank : 0 (local_rank: 0) exitcode : 1 (pid: 2307807) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ############################### end time: 2024-05-08 16:30:01.071965 elapsed time: 0:00:20.026144 slurm submission log: 2024-05-09 07:34:36.661583 created following sbatch script: ############################### #!/bin/bash #SBATCH --account=nlp #SBATCH --cpus-per-task=16 #SBATCH --gres=gpu:2 #SBATCH --job-name=tthrush-job-782921 #SBATCH --mem=400G #SBATCH --nodelist=sphinx2 #SBATCH --open-mode=append #SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en/train_job_output.txt #SBATCH --partition=sphinx #SBATCH --time=14-0 # activate your desired anaconda environment . /nlp/scr/tthrush/miniconda3/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection # cd to working directory cd . # launch commands srun --unbuffered run_as_child_processes 'torchrun --master_port 29501 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/training_data/xnli_en --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en --output_hub_id pythia-70m_xnli_en --model_id EleutherAI/pythia-70m --num_train_epochs 1 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2' ############################### submission to slurm complete! ############################### slurm submission output Submitted batch job 7591646 ############################### ############################### start time: 2024-05-09 08:56:15.201472 machine: sphinx2 conda env: pretraining-coreset-selection ############################### running following processes torchrun --master_port 29501 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/training_data/xnli_en --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en --output_hub_id pythia-70m_xnli_en --model_id EleutherAI/pythia-70m --num_train_epochs 1 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2 ############################### command outputs: [2024-05-09 08:56:17,398] torch.distributed.run: [WARNING] [2024-05-09 08:56:17,398] torch.distributed.run: [WARNING] ***************************************** [2024-05-09 08:56:17,398] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-05-09 08:56:17,398] torch.distributed.run: [WARNING] ***************************************** 05/09/2024 08:56:22 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/training_data/xnli_en', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en', output_hub_id='pythia-70m_xnli_en', hf_hub_token=True, model_id='EleutherAI/pythia-70m', per_device_train_batch_size=256, num_train_epochs=1, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False) 05/09/2024 08:56:22 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/training_data/xnli_en', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en', output_hub_id='pythia-70m_xnli_en', hf_hub_token=True, model_id='EleutherAI/pythia-70m', per_device_train_batch_size=256, num_train_epochs=1, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False) 0%| | 0/10682 [00:00