UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-07-12-08-18-29-406
Followed
https://www.philschmid.de/sagemaker-falcon-llm
Crashed at
Deploy model to an endpoint
https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)
I used instance: ml.g4dn.xlarge
UnexpectedStatusException Traceback (most recent call last)
in
5 instance_type=instance_type,
6 # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
----> 7 container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
8 )
/opt/conda/lib/python3.7/site-packages/sagemaker/huggingface/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, async_inference_config, serverless_inference_config, volume_size, model_data_download_timeout, container_startup_health_check_timeout, inference_recommendation_id, explainer_config, **kwargs)
326 container_startup_health_check_timeout=container_startup_health_check_timeout,
327 inference_recommendation_id=inference_recommendation_id,
--> 328 explainer_config=explainer_config,
329 )
330
/opt/conda/lib/python3.7/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, async_inference_config, serverless_inference_config, volume_size, model_data_download_timeout, container_startup_health_check_timeout, inference_recommendation_id, explainer_config, **kwargs)
1334 data_capture_config_dict=data_capture_config_dict,
1335 explainer_config_dict=explainer_config_dict,
-> 1336 async_inference_config_dict=async_inference_config_dict,
1337 )
1338
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict, async_inference_config_dict, explainer_config_dict)
4575 self.sagemaker_client.create_endpoint_config(**config_options)
4576
-> 4577 return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
4578
4579 def expand_role(self, role):
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
3968 )
3969 if wait:
-> 3970 self.wait_for_endpoint(endpoint_name)
3971 return endpoint_name
3972
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
4323 message=message,
4324 allowed_statuses=["InService"],
-> 4325 actual_status=status,
4326 )
4327 return desc
UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-07-12-08-18-29-406: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..
The endpoint container is not healthy and is restarting. Check the endpoint cloudwatch logs for details.
I am tryning to deploy falcon-40b and experiencing the same error since I moved to
# install supported sagemaker SDK
!pip install "sagemaker==2.175.0" --upgrade --quiet
and
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="0.9.3"
)
all working fine with llm_image version "0.8.2"
both tests done with
import json
from sagemaker.huggingface import HuggingFaceModel
# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300
# TGI config
config = {
'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
# 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}
# create HuggingFaceModel
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
with falcon-7b I am able to succesfully deploy using version "0.9.3"
details of my error
UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-08-08-09-21-35-398: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..
What is the error you see in cloudwatch?