Training in progress, epoch 1

Browse files

Files changed (4) hide show

eval_job_output.txt +100 -4
logs/events.out.tfevents.1715270185.sphinx2 +3 -0
model.safetensors +1 -1
train_job_output.txt +0 -0

eval_job_output.txt CHANGED Viewed

@@ -1,4 +1,4 @@
-slurm submission log: 2024-05-07 15:08:34.265333
 created following sbatch script:
 ###############################
@@ -7,9 +7,9 @@ created following sbatch script:
 #SBATCH --account=nlp
 #SBATCH --cpus-per-task=16
-#SBATCH --dependency=afterok:7543205
 #SBATCH --gres=gpu:1
-#SBATCH --job-name=tthrush-job-3132094
 #SBATCH --mem=60G
 #SBATCH --nodelist=sphinx2
 #SBATCH --open-mode=append
@@ -34,7 +34,103 @@ submission to slurm complete!
 ###############################
 slurm submission output
-Submitted batch job 7543206

+slurm submission log: 2024-05-08 15:15:14.783860
 created following sbatch script:
 ###############################
 #SBATCH --account=nlp
 #SBATCH --cpus-per-task=16
+#SBATCH --dependency=afterok:7590683
 #SBATCH --gres=gpu:1
+#SBATCH --job-name=tthrush-job-534086
 #SBATCH --mem=60G
 #SBATCH --nodelist=sphinx2
 #SBATCH --open-mode=append
 ###############################
 slurm submission output
+Submitted batch job 7590684
+###############################
+###############################
+start time: 2024-05-08 16:30:21.427634
+machine: sphinx2
+conda env: pretraining-coreset-selection
+###############################
+running following processes
+	lm_eval --model hf --model_args pretrained=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en,revision=main,dtype=float16,trust_remote_code=True --tasks xnli_en,xnli_fr,sciq,piqa,lambada,arc_easy --device cuda --output_path /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en/perf
+###############################
+command outputs:
+2024-05-08:16:30:23,469 INFO     [utils.py:145] Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
+2024-05-08:16:30:23,469 INFO     [utils.py:148] Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
+2024-05-08:16:30:23,469 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
+2024-05-08:16:30:23,683 INFO     [config.py:58] PyTorch version 2.2.2 available.
+2024-05-08:16:30:26,810 INFO     [__main__.py:156] Verbosity set to INFO
+2024-05-08:16:30:32,728 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
+/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for hails/mmlu_no_train contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/hails/mmlu_no_train
+You can avoid this message in future by passing the argument `trust_remote_code=True`.
+Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
+  warnings.warn(
+2024-05-08:16:31:48,007 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
+2024-05-08:16:31:48,012 INFO     [__main__.py:229] Selected Tasks: ['arc_easy', 'lambada', 'piqa', 'sciq', 'xnli_en', 'xnli_fr']
+2024-05-08:16:31:48,364 INFO     [huggingface.py:148] Using device 'cuda'
+Traceback (most recent call last):
+  File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/bin/lm_eval", line 8, in <module>
+    sys.exit(cli_evaluate())
+  File "/sailhome/tthrush/lm-evaluation-harness/lm_eval/__main__.py", line 231, in cli_evaluate
+    results = evaluator.simple_evaluate(
+  File "/sailhome/tthrush/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
+    return fn(*args, **kwargs)
+  File "/sailhome/tthrush/lm-evaluation-harness/lm_eval/evaluator.py", line 98, in simple_evaluate
+    lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
+  File "/sailhome/tthrush/lm-evaluation-harness/lm_eval/api/model.py", line 134, in create_from_arg_string
+    return cls(**args, **args2)
+  File "/sailhome/tthrush/lm-evaluation-harness/lm_eval/models/huggingface.py", line 174, in __init__
+    self._get_config(
+  File "/sailhome/tthrush/lm-evaluation-harness/lm_eval/models/huggingface.py", line 420, in _get_config
+    self._config = transformers.AutoConfig.from_pretrained(
+  File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1138, in from_pretrained
+    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
+  File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
+    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
+  File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
+    resolved_config_file = cached_file(
+  File "/nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/lib/python3.10/site-packages/transformers/utils/hub.py", line 369, in cached_file
+    raise EnvironmentError(
+OSError: /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en does not appear to have a file named config.json. Checkout 'https://huggingface.co//juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en/tree/main' for available files.
+###############################
+end time: 2024-05-08 16:31:51.524193
+elapsed time: 0:01:30.096559
+slurm submission log: 2024-05-09 07:34:36.889218
+created following sbatch script:
+###############################
+#!/bin/bash
+#SBATCH --account=nlp
+#SBATCH --cpus-per-task=16
+#SBATCH --dependency=afterok:7591646
+#SBATCH --gres=gpu:1
+#SBATCH --job-name=tthrush-job-4681876
+#SBATCH --mem=60G
+#SBATCH --nodelist=sphinx2
+#SBATCH --open-mode=append
+#SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en/eval_job_output.txt
+#SBATCH --partition=sphinx
+#SBATCH --time=14-0
+# activate your desired anaconda environment
+. /nlp/scr/tthrush/miniconda3/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection
+# cd to working directory
+cd .
+# launch commands
+srun --unbuffered run_as_child_processes 'lm_eval --model hf --model_args pretrained=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en,revision=main,dtype=float16,trust_remote_code=True --tasks xnli_en,xnli_fr,sciq,piqa,lambada,arc_easy --device cuda --output_path /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms/pythia-70m_xnli_en/perf'
+###############################
+submission to slurm complete!
+###############################
+slurm submission output
+Submitted batch job 7591647

logs/events.out.tfevents.1715270185.sphinx2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2b1c1c687b1100aa90b7c7592fa5bf79868362accc8d1fa9250cb6649bc893ab
+size 95282

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7d7656dba55ad749f46457103a1d9eb71eba4eaea53ba76a6d594e8311d1934b
 size 281715176

 version https://git-lfs.github.com/spec/v1
+oid sha256:b5637eacff0cbb3d000b41088168f088919a8cc4a1f613b4eec974fc29ba7202
 size 281715176

train_job_output.txt CHANGED Viewed

The diff for this file is too large to render. See raw diff