Can you make ARM optimized quants too?
Like Q4_0_4_4 GGUF. 7b models can run ok in even a good mid range phones under koboldai termux, but the normal quants are slow in arm.
I suspect he means to use TPU which is embedded in most current ARM SoC. Hardly anyone supports that, yet.
That is something id love to see too. Messed with it some, limited success, but when it worked it was a great thing.
You mean Google tensor? Not really. I just want something like this what running fast enough in my poco X6 pro under termux koboldai
https://huggingface.co/SicariusSicariiStuff/EVA-UNIT-01_EVA-Qwen2.5-7B-v0.0_ARM
I don't know the technical details.
You mean Google tensor? Not really. I just want something like this what running fast enough in my poco X6 pro under termux koboldai
https://huggingface.co/SicariusSicariiStuff/EVA-UNIT-01_EVA-Qwen2.5-7B-v0.0_ARM
I don't know the technical details.
No i do not mean google. Different companies do call them different things ( love standards ) but modern ARM SoCs have on board processing for AI. Some call them TPU, some call them NPU, and a few other things. Using that is the only way to get decent speed out of an arm machine ( unless you go apple with a GPU ). Unfortunately that part of the industry is still in dis-array so its not 'plug and play' yet, like NVIDIA would be on x86. Its doable, but its not easy. With luck that stabilizes and something like lllama.cpp can include that in the driver set.
( and actually IoT level TPU From google isn't going to work either. It has its place in the AI world, but LLM is not it. )