Inference endpoint deployment for 'meta-llama/Meta-Llama-3.1-8B-Instruct' fails

#62

by Keertiraj - opened Jul 29

Jul 29

The inference endpoint deployment for the for 'meta-llama/Meta-Llama-3.1-8B-Instruct' fails. I chose 'GPU · Nvidia L4 · 1x GPU · 24 GB' instance.

If anyone has resolved the issue, it would be great if you can share the response.

The error log:

Endpoint encountered an error.
You can try restarting it using the "pause" button above. Check logs for more details.
[Server message]Endpoint failed to start
See details
ine 253, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 333, in get_model\n return FlashLlama(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 70, in __init__\n config = AutoConfig.from_pretrained(\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 952, in from_pretrained\n return config_class.from_dict(config_dict, **unused_kwargs)\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 761, in from_dict\n config = cls(**config_dict)\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py", line 161, in __init__\n self._rope_scaling_validation()\n\n File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py", line 181, in _rope_scaling_validation\n raise ValueError(\n\nValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2024-07-29T13:51:16.757745Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-07-29T13:51:16.757779Z","level":"INFO","fields":{"message":"Shutting down shards"},"targ

antony-pk

Jul 31

•

edited Jul 31

I'm also getting the same error

ine 253, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n    model = get_model(\n\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 333, in get_model\n    return FlashLlama(\n\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py\", line 70, in __init__\n    config = AutoConfig.from_pretrained(\n\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py\", line 952, in from_pretrained\n    return config_class.from_dict(config_dict, **unused_kwargs)\n\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py\", line 761, in from_dict\n    config = cls(**config_dict)\n\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py\", line 161, in __init__\n    self._rope_scaling_validation()\n\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py\", line 181, in _rope_scaling_validation\n    raise ValueError(\n\nValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2024-07-31T10:23:40.823645Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-07-31T10:23:40.823662Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}

ValueError: rope_scaling must be a dictionary with two fields, type and factor this is the main error.

In my case I used the LoRA Adapter + Base Model.

I tried to install the latest package of transformers creating the requirements.txt file and add the transformers new version even though its not working. Reference

antony-pk

Jul 31

•

edited Jul 31

Try This

I tried this it's working. please check the below. the model I'm using is my own fine-tuned model from llama3.1. please check the below.

https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15#66a0857b5d5f5950b2f62f8c

nbroad

Aug 1

Try using a custom container

url: ghcr.io/huggingface/text-generation-inference:2.2.0

environment variables

MODEL_ID=/repository

(use the actual word "repository")

Keertiraj

Aug 1

@nbroad thank you for your response. Could you please share the full code?

antony-pk

Aug 1

•

edited Aug 1

@nbroad thank you for your response. Could you please share the full code?

I used the HuggingFace Endpoint For deployed the model.

Here the config.json file

{
  "_name_or_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128256,
  "eos_token_id": 128257,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128257,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "type": "dynamic"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128258
}

nbroad

Aug 1

•

edited Aug 1

Click the "advanced configuration" dropdown, then use the settings below

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment