Deploying this on Text Generation Inference (TGI) server on AWS SageMaker

#38
by ZaydJamadar - opened

Now that HuggingFace TGI server can be deployed on AWS SageMaker as a deep learning container, it requires a HuggingFaceModel class LLM. How can I convert this GPTQ quantised LLM to that class?

Hi there,

It is possible to deploy the model using TGI and sagemaker. You just have to adapt the configuration.

Here is an example of how to deploy the main branch to sagemaker:

from sagemaker.huggingface import get_huggingface_llm_image_uri

llm_image = get_huggingface_llm_image_uri("huggingface", version="0.9.3", session=sess)

If you don't know how to set up a session, just check out this resource: https://huggingface.co/docs/sagemaker/inference#installation-and-setup

sagemaker config

instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
quantize="gptq"
num_shard=1
bits= 4
group_size= 128
revision = 'main' # branch of repo, related with HF_MODEL_ID 

Define Model and Endpoint configuration parameter

config = {
    'HF_MODEL_ID':'TheBloke/Llama-2-13B-chat-GPTQ',
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'QUANTIZE':  quantize,
    'NUM_SHARD':json.dumps(num_shard),
    'GPTQ_BITS': json.dumps(bits),
    'GPTQ_GROUPSIZE': json.dumps(group_size),
    'REVISION': revision
}

create HuggingFaceModel with the image uri


llm_model = HuggingFaceModel(role=role, image_uri=llm_image, env=hub, sagemaker_session=sess)

Deploy model to an endpoint

https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,  # 10 minutes to be able to load the model
)

Sign up or log in to comment