KeyError: 'model.layers.0.mlp.down_proj.g_idx' ..?
aphroditre cli
aphrodite run ./llama-3-marlin/ --quantization marlin -
-tensor-parallel-size 2 --gpu-memory-utilization 1.0 --kv-cache-dtype fp8 --max-model-len 8192 --host 0.0.0
.0 --port 8888 --served-model-name custom_model
aphrodite info
INFO: CUDA_HOME is not found in the environment. Using /usr/local/cuda as CUDA_HOME.
INFO: Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the
performance. But it may cause slight accuracy drop without scaling factors. FP8_E5M2 (without scaling) is
only supported on cuda version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for
common inference criteria.
2024-06-11 12:26:02,678 INFO worker.py:1749 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO: Model = './llama-3-marlin/'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = marlin
INFO: Context Length = 8192
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = fp8
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO: Using FlashAttention backend.
(RayWorkerAphrodite pid=78653) WARNING: Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerAphrodite pid=78653) INFO: Using FlashAttention backend.
INFO: Aphrodite is using nccl==2.20.5
(RayWorkerAphrodite pid=78653) INFO: Aphrodite is using nccl==2.20.5
INFO: reading GPU P2P access cache from
/home/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed.
To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=78653) INFO: reading GPU P2P access cache from
(RayWorkerAphrodite pid=78653) /home/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
(RayWorkerAphrodite pid=78653) WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed.
(RayWorkerAphrodite pid=78653) To silence this warning, specify disable_custom_all_reduce=True explicitly.
-- ERROR --
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/miniforge3/envs/aphrodite/bin/aphrodite", line 8, in
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/endpoints/cli.py", line 25, in main
[rank0]: args.func(args)
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/endpoints/openai/api_server.py", line 519, in run_server
[rank0]: engine = AsyncAphrodite.from_engine_args(engine_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 358, in from_engine_args
[rank0]: engine = cls(engine_config.parallel_config.worker_use_ray,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 323, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 429, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/aphrodite_engine.py", line 131, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/executor_base.py", line 39, in init
[rank0]: self._init_executor()
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/ray_gpu_executor.py", line 45, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/ray_gpu_executor.py", line 193, in _init_workers_ray
[rank0]: self._run_workers(
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/ray_gpu_executor.py", line 309, in _run_workers
[rank0]: driver_worker_output = getattr(self.driver_worker,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/task_handler/worker.py", line 125, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/task_handler/model_runner.py", line 179, in load_model
[rank0]: self.model = get_model(
[rank0]: ^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/modeling/loader.py", line 103, in get_model
[rank0]: model.load_weights(model_config.model, model_config.download_dir,
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/modeling/models/llama.py", line 497, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.0.mlp.down_proj.g_idx'
--
I just downloaded marlin model and run to aphrodite engine.
I tried AutoGPTQ too.
This model uses gptq_marlin, which is comparable to Marlin. vllm will automatically convert this gptq_marlin format to marlin upon initialization. Please try vllm now, as we plan to switch to the actual marlin format soon. Apologies for any confusion.
ref: https://github.com/vllm-project/vllm/issues/5080
- config.json:
"quantization_config": {
"bits": 4,
"damp_percent": 0.01,
"desc_act": false,
"group_size": 128,
"is_marlin_format": true,
"model_file_base_name": null,
"model_name_or_path": null,
"quant_method": "gptq",
"static_groups": false,
"sym": true,
"true_sequential": true
},
vLLM not working too...
vLLM/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 427, in load_weights
[rank0]: param = params_dict[name]
[rank0]: KeyError: 'model.layers.0.mlp.down_proj.g_idx'
^C
Hi andreass123, Due to the difficulties with version dependencies management, We just added a GPTQ version of the model. Sorry for inconvenience!
https://huggingface.co/allganize/Llama-3-Alpha-Ko-8B-Instruct-GPTQ
thx!