AWS Neuron optimum model cache
This repository contains cached neuron compilation artifacts for the most popular models on the Hugging Face Hub.
Inference
LLM models
The transparent caching mechanism included in optimum-neuron
and NeuronX TGI
, makes it easier to export and deploy cached models to Neuron platforms such as Trainium and Inferentia.
To deploy directly any cached model to SageMaker:
- go to the model page,
- select "Deploy" in the top right corner,
- select "AWS SageMaker" in the drop-down,
- select the "AWS Inferentia & Trainium" tab,
- copy the code snippet.
You can now paste the code snippet in your deployment script or notebook, following the instructions in the comment.
To export a model to Neuron and save it locally, please follow the instructions in the optimum-neuron
documentation.
For a list of the cached models and configurations, please refer to the inference cache configuration files.
Alternatively, you can use the optimum-cli neuron cache lookup
command to look for a specific model and see the cached configurations.