Inference endpoint deployment for 'meta-llama/Meta-Llama-3.1-8B-Instruct' fails

#62
by Keertiraj - opened

The inference endpoint deployment for the for 'meta-llama/Meta-Llama-3.1-8B-Instruct' fails. I chose 'GPU · Nvidia L4 · 1x GPU · 24 GB' instance.

If anyone has resolved the issue, it would be great if you can share the response.

The error log:

Endpoint encountered an error.
You can try restarting it using the "pause" button above. Check logs for more details.
[Server message]Endpoint failed to start
See details
ine 253, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 333, in get_model\n return FlashLlama(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 70, in __init__\n config = AutoConfig.from_pretrained(\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 952, in from_pretrained\n return config_class.from_dict(config_dict, **unused_kwargs)\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 761, in from_dict\n config = cls(**config_dict)\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py", line 161, in __init__\n self._rope_scaling_validation()\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py", line 181, in _rope_scaling_validation\n raise ValueError(\n\nValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2024-07-29T13:51:16.757745Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-07-29T13:51:16.757779Z","level":"INFO","fields":{"message":"Shutting down shards"},"targ

I'm also getting the same error

ine 253, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n    model = get_model(\n\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 333, in get_model\n    return FlashLlama(\n\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py\", line 70, in __init__\n    config = AutoConfig.from_pretrained(\n\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py\", line 952, in from_pretrained\n    return config_class.from_dict(config_dict, **unused_kwargs)\n\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py\", line 761, in from_dict\n    config = cls(**config_dict)\n\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py\", line 161, in __init__\n    self._rope_scaling_validation()\n\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py\", line 181, in _rope_scaling_validation\n    raise ValueError(\n\nValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2024-07-31T10:23:40.823645Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-07-31T10:23:40.823662Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}

ValueError: rope_scaling must be a dictionary with two fields, type and factor this is the main error.

In my case I used the LoRA Adapter + Base Model.

I tried to install the latest package of transformers creating the requirements.txt file and add the transformers new version even though its not working. Reference

Try This

I tried this it's working. please check the below. the model I'm using is my own fine-tuned model from llama3.1. please check the below.

https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15#66a0857b5d5f5950b2f62f8c

Try using a custom container

url: ghcr.io/huggingface/text-generation-inference:2.2.0

environment variables

MODEL_ID=/repository

(use the actual word "repository")

@nbroad thank you for your response. Could you please share the full code?

@nbroad thank you for your response. Could you please share the full code?

I used the HuggingFace Endpoint For deployed the model.

Here the config.json file

{
  "_name_or_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128256,
  "eos_token_id": 128257,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128257,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "type": "dynamic"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128258
}

Click the "advanced configuration" dropdown, then use the settings below

image.png

Sign up or log in to comment