seba commited on
Commit
e6c9b34
1 Parent(s): edfe3bc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen2-0.5B
3
+ pipeline_tag: text-generation
4
+ ---
5
+ ANE-compatible stateful CoreML models. Maximum context length of 512.
6
+ Multifunction models that process 1 or 64 tokens.
7
+
8
+ 6 bits quantized models apply a grouped-per-output-channel LUT with group size 4.
9
+ For example if the weights have shape (32, 64), the LUT has shape (8, 1, 36), ANE does not support
10
+ per-input channel grouping, and smaller group sizes are considerably slower, while larger group size are barely faster.
11
+
12
+ After LUT dequantization a per-output-channel scaling is applied (would have size (32, 1) for the same example shapes).
13
+
14
+ Quantization is not applied to the first and last layers, and embeddings (head weights are shared with input embeddings).
15
+
16
+ Current issues:
17
+ - Input embeddings are duplicated, once for the input and another for the prediction head, since ANE supports a maximum size of `16_384`, the weights have to be split, which causes CoreML to duplicate the weights. It should be possible to remove the input embeddings and read the weights directly for the `weights.bin` file.
18
+
19
+ This model requires iOS18 or MacOS 15 to run, and CoreMLTools Beta if running in Python (`pip install coremltools==8.0b2`)
20
+
21
+ And example on how to use the models can be found in the `coreml_example.py` and can be run with the following command `python src/coreml_example.py --model-path ./nbs/Qwen-2-1.5B-6Bits-MF.mlmodelc -p "Write a joke in a poem of Harry Potter" --max-tokens 200 --min_p 0.2 --temp 1.5`