Request for adding mistral with batch_size=8 or batch_size=4
#3
by
michaelfeil
- opened
I am on the latest huggingface ami, using optimum. (neuronxcc-2.12.68.0+4480452af)
I would like to speed up
mistralai/Mistral-7B-v0.1
I would like to speed-up the compile time{'num_cores': 24, 'auto_cast_type': 'bf16', 'batch_size': 8, 'sequence_length': 2048}
and{'num_cores': 2, 'auto_cast_type': 'bf16', 'batch_size': 8, 'sequence_length': 2048}
michaelfeil
changed discussion title from
Request for adding mistral with batch_size=8
to Request for adding mistral with batch_size=8 or batch_size=4
I jsut noticed that batch_size=8 leads to compile errors on inf2.48xlarge, and takes very long on inf2.xlarge. The larges batch_size I have seen successfully was 4.
python -m lm_eval --model "neuronx" --model_args "pretrained=$MODEL_ID,dtype=bfloat16" --batch_size 8 --tasks gsm8k
Downloading builder script: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.67k/5.67k [00:00<00:00, 20.6MB/s]
INFO:lm-eval:Verbosity set to INFO
INFO:lm-eval:lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
INFO:lm-eval:Selected Tasks: ['gsm8k']
INFO:lm-eval:Loading selected tasks...
inferring nc_count from `neuron-ls` b'[\n {\n "neuron_device": 0,\n "bdf": "00:1e.0",\n "connected_to": null,\n "nc_count": 2,\n "memory_size": 34359738368,\n "neuron_processes": []\n }\n]'
nc_count=2
config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 571/571 [00:00<00:00, 213kB/s]
tokenizer_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 967/967 [00:00<00:00, 388kB/s]
tokenizer.model: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 493k/493k [00:00<00:00, 9.42MB/s]
tokenizer.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.80M/1.80M [00:00<00:00, 20.1MB/s]
special_tokens_map.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 72.0/72.0 [00:00<00:00, 81.5kB/s]
====================
loading model to neuron with {'num_cores': 2, 'auto_cast_type': 'bf16'}, {'batch_size': 8, 'sequence_length': 2048}...
model.safetensors.index.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 25.1k/25.1k [00:00<00:00, 10.1MB/s]
model-00001-of-00002.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 9.94G/9.94G [01:07<00:00, 148MB/s]
model-00002-of-00002.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.54G/4.54G [00:21<00:00, 214MB/s]
Downloading shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [01:28<00:00, 44.30s/it]
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [03:55<00:00, 117.82s/it]
generation_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 116/116 [00:00<00:00, 187B/s]
2024-02-08 00:29:23.000001: 2822 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-08 00:29:23.000137: 2823 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.68.0+4480452af/MODULE_798e49273cfea1916bed+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_798e49273cfea1916bed+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_5418a1f3d9c8941648ef+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:23.000687: 2827 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.68.0+4480452af/MODULE_798e49273cfea1916bed+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:23.000767: 2823 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/238701b5-5769-4540-a463-53c5c737bca9/model.MODULE_798e49273cfea1916bed+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/238701b5-5769-4540-a463-53c5c737bca9/model.MODULE_798e49273cfea1916bed+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.68.0+4480452af/MODULE_5418a1f3d9c8941648ef+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:23.000802: 2824 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.68.0+4480452af/MODULE_5418a1f3d9c8941648ef+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:23.000881: 2822 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/508bf987-f00b-42ec-8c64-df1043db37f2/model.MODULE_5418a1f3d9c8941648ef+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/508bf987-f00b-42ec-8c64-df1043db37f2/model.MODULE_5418a1f3d9c8941648ef+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.68.0+4480452af/MODULE_19efcf02f204ed8d1b74+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:24.000075: 2828 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.68.0+4480452af/MODULE_8be6608305b3cf1772e8+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_19efcf02f204ed8d1b74+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_8be6608305b3cf1772e8+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_19efcf02f204ed8d1b74+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:24.000255: 2827 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/b9a2e7f2-5fb4-4746-b0cf-e4b55f4e8483/model.MODULE_19efcf02f204ed8d1b74+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/b9a2e7f2-5fb4-4746-b0cf-e4b55f4e8483/model.MODULE_19efcf02f204ed8d1b74+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.68.0+4480452af/MODULE_8be6608305b3cf1772e8+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:24.000322: 2824 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/b7d72dbe-45f6-4434-a31a-317fc86f79fa/model.MODULE_8be6608305b3cf1772e8+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/b7d72dbe-45f6-4434-a31a-317fc86f79fa/model.MODULE_8be6608305b3cf1772e8+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-02-08 00:29:24.000359: 2829 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-08 00:29:24.000465: 2830 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.68.0+4480452af/MODULE_fbc43912d29fea5eb61a+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_fbc43912d29fea5eb61a+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:24.000605: 2831 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.68.0+4480452af/MODULE_fbc43912d29fea5eb61a+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:24.000678: 2828 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/6fa8c3ba-df86-4a20-914b-6da56c634576/model.MODULE_fbc43912d29fea5eb61a+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/6fa8c3ba-df86-4a20-914b-6da56c634576/model.MODULE_fbc43912d29fea5eb61a+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.68.0+4480452af/MODULE_ed30c8f43082adbfca9b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_a09916bf3e2525d35501+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_ed30c8f43082adbfca9b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_a09916bf3e2525d35501+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_0c31b79289b165fc10bd+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_ed30c8f43082adbfca9b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:24.000960: 2829 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/9b2d6340-4159-47db-8454-dd2b5fe07ad4/model.MODULE_ed30c8f43082adbfca9b+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/9b2d6340-4159-47db-8454-dd2b5fe07ad4/model.MODULE_ed30c8f43082adbfca9b+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.68.0+4480452af/MODULE_a09916bf3e2525d35501+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:24.000980: 2830 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/b9eb5fdb-6ee7-4c24-a39e-723939b1ae07/model.MODULE_a09916bf3e2525d35501+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/b9eb5fdb-6ee7-4c24-a39e-723939b1ae07/model.MODULE_a09916bf3e2525d35501+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.68.0+4480452af/MODULE_0c31b79289b165fc10bd+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_0c31b79289b165fc10bd+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:25.000154: 2831 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/d5e5b675-4d43-4849-9cad-30dc1b80946e/model.MODULE_0c31b79289b165fc10bd+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/d5e5b675-4d43-4849-9cad-30dc1b80946e/model.MODULE_0c31b79289b165fc10bd+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-02-08 00:29:26.000333: 2825 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.68.0+4480452af/MODULE_55f17a0d407f8a37dca2+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_55f17a0d407f8a37dca2+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_55f17a0d407f8a37dca2+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:26.000841: 2825 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/1872ff9e-6028-4a43-a971-12555514ebe0/model.MODULE_55f17a0d407f8a37dca2+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/1872ff9e-6028-4a43-a971-12555514ebe0/model.MODULE_55f17a0d407f8a37dca2+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-02-08 00:29:32.000047: 2826 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.68.0+4480452af/MODULE_e2b63c2c15189aeb654d+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_e2b63c2c15189aeb654d+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_e2b63c2c15189aeb654d+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-02-08 00:29:32.000608: 2826 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/e88ea9f6-356d-4cbd-b94d-c762b88ddf2c/model.MODULE_e2b63c2c15189aeb654d+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/e88ea9f6-356d-4cbd-b94d-c762b88ddf2c/model.MODULE_e2b63c2c15189aeb654d+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
......................................................................................................................... ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Compiler status PASS
.................................................................................
Compiler status PASS
........................
Compiler status PASS
.........
Compiler status PASS
........................................................
Compiler status PASS
.......
Compiler status PASS
........................................
2024-02-08 02:23:30.000260: 2826 ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/e88ea9f6-356d-4cbd-b94d-c762b88ddf2c/model.MODULE_e2b63c2c15189aeb654d+2c2d707e.hlo.pb after 0 retries.
....................................................
Compiler status PASS
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
I raised a similar issue for Llama two months ago in transformers-neuronx
: https://github.com/aws-neuron/transformers-neuronx/issues/59.
That looks like the same issue!
The configurations mentioned above are now cached.
dacorvo
changed discussion status to
closed