qualcomm/Llama-v3-8B-Chat · Custom LLM Support

Hi, I am planning to use a custom, smaller LLM. I have gone through the codebase for Llama and implemented some changes like replacing Linears with Convs. The model is 1B. According to the instructions provided by QC's Llama docs, it seems models < 7B can run in a single go on the hardware.
My question is, how do I use the Llama's export function for my custom model? Is it compatible if I replace Llama's model with my own model and make similar changes in all required files? Can the export function which calls hub export the QNN version?
TLDR: Is the hub export, submit_inference_jobs, submit_profile_job etc. all model specific (implemented specific per model on QC side) or model agnostic?
Further, are the quantization operations etc. performed directly on the QC side or am I required to pre-quantize the weights on my end?

If you run the export scripts for Llama 2 and Llama 3, it will pull down already quantized calibration parameters, and export the quantized model to an ONNX file and AIMET encodings. This format is communicated to AI Hub for compilation.

The difficulty of adapting this recipe to a custom model is the quantization parameters. We do not have a fully recipe where we explain how to quantize your own model. This is something that we are actively working on, but we cannot offer today. We are working toward being able to quantize LLMs directly through AI Hub, which will simplify the user story for custom models significantly.

Models of size 7B have to be quantized to be viable to run on device. The model still has to be split up into multiple context binaries.

If you want to deploy something as small as 1B, you may find success without quantizing the activations of the model, which would simplify the endeavor. You can use the Llama 2/3 code as reference, but skip the encodings section. Meanwhile, please look out for announcements by joining our Slack channel or follow our repositories.