Hi there,

It is possible to deploy the model using TGI and sagemaker. You just have to adapt the configuration.

Here is an example of how to deploy the main branch to sagemaker:

from sagemaker.huggingface import get_huggingface_llm_image_uri

llm_image = get_huggingface_llm_image_uri("huggingface", version="0.9.3", session=sess)

If you don't know how to set up a session, just check out this resource: https://huggingface.co/docs/sagemaker/inference#installation-and-setup

sagemaker config

instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
quantize="gptq"
num_shard=1
bits= 4
group_size= 128
revision = 'main' # branch of repo, related with HF_MODEL_ID

Define Model and Endpoint configuration parameter

config = {
    'HF_MODEL_ID':'TheBloke/Llama-2-13B-chat-GPTQ',
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'QUANTIZE':  quantize,
    'NUM_SHARD':json.dumps(num_shard),
    'GPTQ_BITS': json.dumps(bits),
    'GPTQ_GROUPSIZE': json.dumps(group_size),
    'REVISION': revision
}

create HuggingFaceModel with the image uri


llm_model = HuggingFaceModel(role=role, image_uri=llm_image, env=hub, sagemaker_session=sess)

Deploy model to an endpoint

https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,  # 10 minutes to be able to load the model
)

TheBloke
/

Llama-2-13B-chat-GPTQ

Deploying this on Text Generation Inference (TGI) server on AWS SageMaker

sagemaker config

Define Model and Endpoint configuration parameter

create HuggingFaceModel with the image uri

Deploy model to an endpoint

https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy