seba
/

qwen-2-coreml-ane

Text Generation

Model card Files Files and versions Community

seba commited on Sep 10

Commit

e6c9b34

•

1 Parent(s): edfe3bc

Create README.md

Files changed (1) hide show

README.md +21 -0

README.md ADDED Viewed

	@@ -0,0 +1,21 @@

+---
+base_model: Qwen/Qwen2-0.5B
+pipeline_tag: text-generation
+---
+ANE-compatible stateful CoreML models. Maximum context length of 512.
+Multifunction models that process 1 or 64 tokens.
+6 bits quantized models apply a grouped-per-output-channel LUT with group size 4.
+For example if the weights have shape (32, 64), the LUT has shape (8, 1, 36), ANE does not support
+per-input channel grouping, and smaller group sizes are considerably slower, while larger group size are barely faster.
+After LUT dequantization a per-output-channel scaling is applied (would have size (32, 1) for the same example shapes).
+Quantization is not applied to the first and last layers, and embeddings (head weights are shared with input embeddings).
+Current issues:
+- Input embeddings are duplicated, once for the input and another for the prediction head, since ANE supports a maximum size of `16_384`, the weights have to be split, which causes CoreML to duplicate the weights. It should be possible to remove the input embeddings and read the weights directly for the `weights.bin` file.
+This model requires iOS18 or MacOS 15 to run, and CoreMLTools Beta if running in Python (`pip install coremltools==8.0b2`)
+And example on how to use the models can be found in the `coreml_example.py` and can be run with the following command `python src/coreml_example.py --model-path ./nbs/Qwen-2-1.5B-6Bits-MF.mlmodelc -p "Write a joke in a poem of Harry Potter" --max-tokens 200 --min_p 0.2 --temp 1.5`