Inference endpoint deployment for 'meta-llama/Meta-Llama-3.1-8B-Instruct' fails
The inference endpoint deployment for the for 'meta-llama/Meta-Llama-3.1-8B-Instruct' fails. I chose 'GPU · Nvidia L4 · 1x GPU · 24 GB' instance.
If anyone has resolved the issue, it would be great if you can share the response.
The error log:
Endpoint encountered an error.
You can try restarting it using the "pause" button above. Check logs for more details.
[Server message]Endpoint failed to start
See details
ine 253, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 333, in get_model\n return FlashLlama(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 70, in __init__\n config = AutoConfig.from_pretrained(\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 952, in from_pretrained\n return config_class.from_dict(config_dict, **unused_kwargs)\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 761, in from_dict\n config = cls(**config_dict)\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py", line 161, in __init__\n self._rope_scaling_validation()\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py", line 181, in _rope_scaling_validation\n raise ValueError(\n\nValueError: rope_scaling
must be a dictionary with two fields, type
and factor
, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2024-07-29T13:51:16.757745Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-07-29T13:51:16.757779Z","level":"INFO","fields":{"message":"Shutting down shards"},"targ
I'm also getting the same error
ine 253, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 333, in get_model\n return FlashLlama(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py\", line 70, in __init__\n config = AutoConfig.from_pretrained(\n\n File \"/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py\", line 952, in from_pretrained\n return config_class.from_dict(config_dict, **unused_kwargs)\n\n File \"/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py\", line 761, in from_dict\n config = cls(**config_dict)\n\n File \"/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py\", line 161, in __init__\n self._rope_scaling_validation()\n\n File \"/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py\", line 181, in _rope_scaling_validation\n raise ValueError(\n\nValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2024-07-31T10:23:40.823645Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-07-31T10:23:40.823662Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
ValueError: rope_scaling
must be a dictionary with two fields, type
and factor
this is the main error.
In my case I used the LoRA Adapter
+ Base Model
.
I tried to install the latest package of transformers
creating the requirements.txt
file and add the transformers new version even though its not working. Reference
Try This
I tried this it's working. please check the below. the model I'm using is my own fine-tuned model from llama3.1. please check the below.
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15#66a0857b5d5f5950b2f62f8c
Try using a custom container
url: ghcr.io/huggingface/text-generation-inference:2.2.0
environment variables
MODEL_ID=/repository
(use the actual word "repository")
@nbroad thank you for your response. Could you please share the full code?
I used the HuggingFace Endpoint For deployed the model.
Here the config.json
file
{
"_name_or_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128256,
"eos_token_id": 128257,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pad_token_id": 128257,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"type": "dynamic"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.43.3",
"use_cache": true,
"vocab_size": 128258
}