Quantization Onnx FP32 to q4f16 for Web

#2
by nickelshh - opened

Web has model size limitation, and Phi3.5 use q4f16 to reduce the weight, if there any public framework can do that?

Pretty common to use 4bit quantization for llms. I used this script that takes are of it:
https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/builder.py
and under the hood it will use
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/quantization
for the quantization.

Thanks for info, seems even use this builder.py I can not make external data smaller enough to fit chrome browser fetch limitation and the model.onnx is around 1MB only. Is there any specific parameter to make model.onnx bigger and external data smaller? Thanks.

seems I can change the size_threshold to let bigger tensor into external data file. But it seems even I used this way, I cannot get the same inference result as your original onnx_web. is there any specific setting for convert to web such as builder.py xxx -e web?. Thank you

This comment has been hidden
nickelshh changed discussion status to closed
nickelshh changed discussion status to open

I use this script to shape the external data:
https://github.com/guschmue/ort-web-perf/blob/master/onnx-chunk-external-data.py

and this script to cast the logits to fp32 so javascript does not need to deal with fp16:
https://github.com/guschmue/ort-web-perf/blob/master/onnx-wrap-fp16.py

The entire thing looks like this:

root=$PWD
model=models/tjs/Phi-3.5-mini-instruct-onnx-web

python builder.py -m models/microsoft/Phi-3.5-mini-instruct -o $model -p int4 -e web
rm -rf /tmp/opt.* /tmp/model.onnx* $model/onnx

mkdir $model/onnx
python onnx/onnx-wrap-fp16.py --input $model/model.onnx  --output /tmp/model.onnx --external_data --name logits
python onnx/onnx-chunk-external-data.py --threshhold 1 --maxchunks 1 --input /tmp/model.onnx --output $model/onnx/model_q4f16.onnx
cp models/microsoft/Phi-3.5-mini-instruct/*.json $model/

Sign up or log in to comment