Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model: Qwen/Qwen2-0.5B
|
3 |
+
pipeline_tag: text-generation
|
4 |
+
---
|
5 |
+
ANE-compatible stateful CoreML models. Maximum context length of 512.
|
6 |
+
Multifunction models that process 1 or 64 tokens.
|
7 |
+
|
8 |
+
6 bits quantized models apply a grouped-per-output-channel LUT with group size 4.
|
9 |
+
For example if the weights have shape (32, 64), the LUT has shape (8, 1, 36), ANE does not support
|
10 |
+
per-input channel grouping, and smaller group sizes are considerably slower, while larger group size are barely faster.
|
11 |
+
|
12 |
+
After LUT dequantization a per-output-channel scaling is applied (would have size (32, 1) for the same example shapes).
|
13 |
+
|
14 |
+
Quantization is not applied to the first and last layers, and embeddings (head weights are shared with input embeddings).
|
15 |
+
|
16 |
+
Current issues:
|
17 |
+
- Input embeddings are duplicated, once for the input and another for the prediction head, since ANE supports a maximum size of `16_384`, the weights have to be split, which causes CoreML to duplicate the weights. It should be possible to remove the input embeddings and read the weights directly for the `weights.bin` file.
|
18 |
+
|
19 |
+
This model requires iOS18 or MacOS 15 to run, and CoreMLTools Beta if running in Python (`pip install coremltools==8.0b2`)
|
20 |
+
|
21 |
+
And example on how to use the models can be found in the `coreml_example.py` and can be run with the following command `python src/coreml_example.py --model-path ./nbs/Qwen-2-1.5B-6Bits-MF.mlmodelc -p "Write a joke in a poem of Harry Potter" --max-tokens 200 --min_p 0.2 --temp 1.5`
|