Deploying this on Text Generation Inference (TGI) server on AWS SageMaker
#38
by
ZaydJamadar
- opened
Now that HuggingFace TGI server can be deployed on AWS SageMaker as a deep learning container, it requires a HuggingFaceModel class LLM. How can I convert this GPTQ quantised LLM to that class?
Hi there,
It is possible to deploy the model using TGI and sagemaker. You just have to adapt the configuration.
Here is an example of how to deploy the main branch to sagemaker:
from sagemaker.huggingface import get_huggingface_llm_image_uri
llm_image = get_huggingface_llm_image_uri("huggingface", version="0.9.3", session=sess)
If you don't know how to set up a session, just check out this resource: https://huggingface.co/docs/sagemaker/inference#installation-and-setup
sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
quantize="gptq"
num_shard=1
bits= 4
group_size= 128
revision = 'main' # branch of repo, related with HF_MODEL_ID
Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID':'TheBloke/Llama-2-13B-chat-GPTQ',
'SM_NUM_GPUS': json.dumps(number_of_gpu),
'QUANTIZE': quantize,
'NUM_SHARD':json.dumps(num_shard),
'GPTQ_BITS': json.dumps(bits),
'GPTQ_GROUPSIZE': json.dumps(group_size),
'REVISION': revision
}
create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(role=role, image_uri=llm_image, env=hub, sagemaker_session=sess)
Deploy model to an endpoint
https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)