AWS Neuron optimum model cache

This repository contains cached neuron compilation artifacts for the most popular models on the Hugging Face Hub.

Inference

LLM models

The transparent caching mechanism included in optimum-neuron and NeuronX TGI, makes it easier to export and deploy cached models to Neuron platforms such as Trainium and Inferentia.

To deploy directly any cached model to SageMaker:

go to the model page,
select "Deploy" in the top right corner,
select "AWS SageMaker" in the drop-down,
select the "AWS Inferentia & Trainium" tab,
copy the code snippet.

You can now paste the code snippet in your deployment script or notebook, following the instructions in the comment.

To export a model to Neuron and save it locally, please follow the instructions in the optimum-neuron documentation.

For a list of the cached models and configurations, please refer to the inference cache configuration files.

Alternatively, you can use the optimum-cli neuron cache lookup command to look for a specific model and see the cached configurations.