mistralai/Pixtral-12B-2409 · OverflowError: out of range integral type conversion attempted

An error when decode tokens, how should i do? error is:

  from vllm.version import __version__ as VLLM_VERSION
INFO 10-29 10:40:01 config.py:1670] Downcasting torch.float32 to torch.float16.
INFO 10-29 10:40:29 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='./Pixtral-12B-2409', speculative_config=None, tokenizer='./Pixtral-12B-2409', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1024, served_model_name=./Pixtral-12B-2409, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-29 10:40:34 model_runner.py:1060] Starting to load model ./Pixtral-12B-2409...
/usr/local/lib/python3.9/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  

@torch

	.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.9/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  

@torch

	.library.impl_abstract("xformers_flash::flash_bwd")
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:43<00:00, 43.24s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:43<00:00, 43.24s/it]

INFO 10-29 10:41:30 model_runner.py:1071] Loading model weights took 23.6552 GB
WARNING 10-29 10:41:31 model_runner.py:1251] Computed max_num_seqs (min(256, 8192 // 20480)) to be less than 1. Setting it to the minimum value of 1.
INFO 10-29 10:41:34 gpu_executor.py:122] # GPU blocks: 18991, # CPU blocks: 1638
INFO 10-29 10:41:34 gpu_executor.py:126] Maximum concurrency for 8192 tokens per request: 37.09x
INFO 10-29 10:41:39 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-29 10:41:39 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-29 10:41:51 model_runner.py:1530] Graph capturing finished in 12 secs.
image_name: 0
WARNING 10-29 10:42:28 chat_utils.py:570] 'add_generation_prompt' is not supported for mistral tokenizer, so it will be ignored.
WARNING 10-29 10:42:28 chat_utils.py:574] 'continue_final_message' is not supported for mistral tokenizer, so it will be ignored.
Processed prompts:   0%|                                                                                                                                                                               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 10-29 10:42:28 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241029-104228.pkl...
WARNING 10-29 10:42:28 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind.<locals>.weak_bound'
Processed prompts:   0%|                                                                                                                                                                               | 0/1 [01:31<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
WARNING 10-29 10:44:06 chat_utils.py:570] 'add_generation_prompt' is not supported for mistral tokenizer, so it will be ignored.
WARNING 10-29 10:44:06 chat_utils.py:574] 'continue_final_message' is not supported for mistral tokenizer, so it will be ignored.
Processed prompts:   0%|                                                                                                                                                                               | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 10-29 10:44:06 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241029-104406.pkl...
WARNING 10-29 10:44:06 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind.<locals>.weak_bound'
Processed prompts:   0%|                                                                                                                                                                               | 0/2 [00:19<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
WARNING 10-29 10:48:10 chat_utils.py:570] 'add_generation_prompt' is not supported for mistral tokenizer, so it will be ignored.
WARNING 10-29 10:48:10 chat_utils.py:574] 'continue_final_message' is not supported for mistral tokenizer, so it will be ignored.
Processed prompts:   0%|                                                                                                                                                                               | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 10-29 10:48:11 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241029-104811.pkl...
WARNING 10-29 10:48:11 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind.<locals>.weak_bound'
INFO 10-29 10:49:01 config.py:1670] Downcasting torch.float32 to torch.float16.
INFO 10-29 10:49:01 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='./Pixtral-12B-2409', speculative_config=None, tokenizer='./Pixtral-12B-2409', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1024, served_model_name=./Pixtral-12B-2409, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-29 10:49:02 model_runner.py:1060] Starting to load model ./Pixtral-12B-2409...
INFO 10-29 10:49:23 config.py:1670] Downcasting torch.float32 to torch.float16.
INFO 10-29 10:49:23 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='./Pixtral-12B-2409', speculative_config=None, tokenizer='./Pixtral-12B-2409', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1024, served_model_name=./Pixtral-12B-2409, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-29 10:49:24 model_runner.py:1060] Starting to load model ./Pixtral-12B-2409...
INFO 10-29 10:49:46 config.py:1670] Downcasting torch.float32 to torch.float16.
INFO 10-29 10:49:46 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='./Pixtral-12B-2409', speculative_config=None, tokenizer='./Pixtral-12B-2409', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1024, served_model_name=./Pixtral-12B-2409, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/worker/model_runner.py", line 1705, in execute_model
[rank0]:     model_input.async_callback()
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/utils.py", line 1125, in weak_bound
[rank0]:     unbound(inst, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/llm_engine.py", line 1123, in _process_model_outputs
[rank0]:     self.output_processor.process_outputs(
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/output_processor/single_step.py", line 95, in process_outputs
[rank0]:     return self._process_sequence_group_outputs(sequence_group, outputs[0],
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/output_processor/single_step.py", line 123, in _process_sequence_group_outputs
[rank0]:     new_char_count = self.detokenizer.decode_sequence_inplace(
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/transformers_utils/detokenizer.py", line 115, in decode_sequence_inplace
[rank0]:     seq.read_offset) = convert_prompt_ids_to_tokens(
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/transformers_utils/detokenizer.py", line 224, in convert_prompt_ids_to_tokens
[rank0]:     new_tokens = tokenizer.convert_ids_to_tokens(
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/transformers_utils/tokenizers/mistral.py", line 227, in convert_ids_to_tokens
[rank0]:     tokens = [self.tokenizer.id_to_byte_piece(id) for id in ids]
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/transformers_utils/tokenizers/mistral.py", line 227, in <listcomp>
[rank0]:     tokens = [self.tokenizer.id_to_byte_piece(id) for id in ids]
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/mistral_common/tokens/tokenizers/tekken.py", line 280, in id_to_byte_piece
[rank0]:     return self._model.decode_single_token_bytes(token_id - self.num_special_tokens)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/tiktoken/core.py", line 272, in decode_single_token_bytes
[rank0]:     return self._core_bpe.decode_single_token_bytes(token)
[rank0]: OverflowError: out of range integral type conversion attempted

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/debugpy/__main__.py", line 71, in <module>
[rank0]:     cli.main()
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/debugpy/server/cli.py", line 501, in main
[rank0]:     run()
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/debugpy/server/cli.py", line 351, in run_file
[rank0]:     runpy.run_path(target, run_name="__main__")
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
[rank0]:     return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
[rank0]:     _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/root/qwen-vl/test_pixtral.py", line 154, in <module>
[rank0]:   File "/root/qwen-vl/test_pixtral.py", line 120, in gen_reponse
[rank0]:     
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/entrypoints/llm.py", line 571, in chat
[rank0]:     return self.generate(
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/utils.py", line 1063, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/entrypoints/llm.py", line 353, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/entrypoints/llm.py", line 879, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/llm_engine.py", line 1386, in step
[rank0]:     outputs = self.model_executor.execute_model(
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/executor/gpu_executor.py", line 134, in execute_model
[rank0]:     output = self.driver_worker.execute_model(execute_model_req)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
[rank0]:     output = self.model_runner.execute_model(
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/dist-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
[rank0]:     raise type(err)(f"Error in model execution: "
[rank0]: OverflowError: Error in model execution: out of range integral type conversion attempted

My example code is

from vllm import LLM
from vllm.sampling_params import SamplingParams
import base64
import pathlib
model_name = "./Pixtral-12B-2409"

               
def gen_reponse():
    max_img_per_msg = 5

    sampling_params = SamplingParams(max_tokens=8192, temperature=0.7, seed=1024)

    # Lower max_num_seqs or max_model_len on low-VRAM GPUs.
    llm = LLM(model=model_name, tokenizer_mode="mistral", limit_mm_per_prompt={"image": max_img_per_msg}, max_model_len=8192, seed=1024)

    #  generate description of the image in Chinese
    prompt = """详细描述图片的内容"""

    raw_image_path_list = pathlib.Path("data/small_data").glob("*.jpg")

    for image_path in raw_image_path_list:
        image_name = image_path.name
        image_name = image_name.split(".")[0] # 1.jpg to get 1
        print(f"image_name: {image_name}")
        # load to base64
        with open(image_path, "rb") as f:
            image_data = f.read()
            image_data = base64.b64encode(image_data).decode("utf-8")

            messages = [
                {
                    "role": "user",
                    "content": [{"type": "text", "text": prompt},
                                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},]
                },
            ]

            # print(messages)
            outputs = llm.chat(messages, sampling_params=sampling_params)

            print(outputs[0].outputs[0].text)

gen_reponse()