TGI - RuntimeError: mat1 and mat2 shapes cannot be multiplied (4145x3072 and 1x14155776)
#3
by
turjo4nis
- opened
config file is using everything default.
Command used to launch TGI:docker run --gpus all --shm-size 1g -p 8080:80 -v $model:$model ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --hostname 0.0.0.0 --num-shard 1 --trust-remote-code --quantize bitsandbytes-nf4
Full Log output:
2024-05-30T18:00:15.018611Z INFO text_generation_launcher: Args { model_id: "/media/turjo/hdd/CSE498r/Phi-3-mini-4k-instruct-bnb-4bit/", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: Some(BitsandbytesNF4), speculate: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-05-30T18:00:15.018654Z INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-05-30T18:00:15.018658Z INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-05-30T18:00:15.018659Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-05-30T18:00:15.018660Z INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-05-30T18:00:15.018662Z WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/media/turjo/hdd/CSE498r/Phi-3-mini-4k-instruct-bnb-4bit/` do not contain malicious code.
2024-05-30T18:00:15.018703Z INFO download: text_generation_launcher: Starting download process.
2024-05-30T18:00:17.328268Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-05-30T18:00:17.621313Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-05-30T18:00:17.621436Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-05-30T18:00:20.720863Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-05-30T18:00:20.724419Z INFO shard-manager: text_generation_launcher: Shard ready in 3.102418385s rank=0
2024-05-30T18:00:20.823765Z INFO text_generation_launcher: Starting Webserver
2024-05-30T18:00:20.865539Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|endoftext|>' was expected to have ID '32000' but was given ID 'None'
2024-05-30T18:00:20.865552Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|assistant|>' was expected to have ID '32001' but was given ID 'None'
2024-05-30T18:00:20.865554Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder1|>' was expected to have ID '32002' but was given ID 'None'
2024-05-30T18:00:20.865556Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder2|>' was expected to have ID '32003' but was given ID 'None'
2024-05-30T18:00:20.865557Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder3|>' was expected to have ID '32004' but was given ID 'None'
2024-05-30T18:00:20.865558Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder4|>' was expected to have ID '32005' but was given ID 'None'
2024-05-30T18:00:20.865559Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|system|>' was expected to have ID '32006' but was given ID 'None'
2024-05-30T18:00:20.865561Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|end|>' was expected to have ID '32007' but was given ID 'None'
2024-05-30T18:00:20.865562Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder5|>' was expected to have ID '32008' but was given ID 'None'
2024-05-30T18:00:20.865563Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder6|>' was expected to have ID '32009' but was given ID 'None'
2024-05-30T18:00:20.865564Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|user|>' was expected to have ID '32010' but was given ID 'None'
2024-05-30T18:00:20.865838Z INFO text_generation_router: router/src/main.rs:253: Using config Some(Mistral)
2024-05-30T18:00:20.865844Z INFO text_generation_router: router/src/main.rs:260: Using local tokenizer config
2024-05-30T18:00:20.865860Z WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model /media/turjo/hdd/CSE498r/Phi-3-mini-4k-instruct-bnb-4bit/
2024-05-30T18:00:20.867713Z INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-05-30T18:00:21.047898Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 98, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 768, in warmup
_, batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
out, speculative_logits = self.forward(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 514, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 461, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 393, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 321, in forward
attn_output = self.self_attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 194, in forward
qkv = self.query_key_value(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 419, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 312, in forward
out = bnb.matmul_4bit(
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 509, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4145x3072 and 1x14155776)
2024-05-30T18:00:21.116801Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
Error: Warmup(Generation("CANCELLED"))
2024-05-30T18:00:21.136724Z ERROR text_generation_launcher: Webserver Crashed
2024-05-30T18:00:21.136732Z INFO text_generation_launcher: Shutting down shards
Error: WebserverFailed
2024-05-30T18:00:21.385222Z INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Can you try upgrading transformers and see if it still works?
pip install "git+https://github.com/huggingface/transformers.git"
Can you try upgrading transformers and see if it still works?
pip install "git+https://github.com/huggingface/transformers.git"
Didn't work unfortunately. Still getting the same error.
@turjo4nis did you try with the upgrade flag
pip install --upgrade "git+https://github.com/huggingface/transformers.git"
Oh wait I don't think TGI supports 4bit