Can not be inferenced with vllm openai server

#1
by jjqsdq - opened
export MODEL_DIR=/root/workspace/model/neuralmagic/Mistral-Nemo-Instruct-2407-FP8
export MODEL_NAME=neuralmagic/Mistral-Nemo-Instruct-2407-FP8
export MAX_MODEL_LEN=16384
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --tensor-parallel-size 1 --quantization="fp8" --host 0.0.0.0 --port 8080 --disable-log-requests --model $MODEL_DIR --served-model-name $MODEL_NAME --max-model-len $MAX_MODEL_LEN
WARNING 07-19 03:04:14 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 395, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 223, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 121, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 147, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 249, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 416, in load_weights
[rank0]:     weight_loader(param, loaded_weight, shard_id)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 662, in weight_loader
[rank0]:     loaded_weight = loaded_weight.narrow(output_dim, start_idx,
[rank0]: RuntimeError: start (0) + length (1280) exceeds dimension size (1024).
Neural Magic org

Need to install vllm from source. The model is not supported in v0.5.2

robertgshaw2 changed discussion status to closed

Sign up or log in to comment