How do i run it?
Sry, which loader works currently? I've tried some for no success, but it could be me doing smth wrong.
With oobagooba and ExLlama_HF loader got the following error : KeyError: 'model.embed_tokens.weight'
Other gptq model load without issue, example TheBlokeSynthia-34B-v1.2-GPTQ load without error (using 21Go Vram)
I heard that ExLlama added support for Yi recently, so you might just need to update ExLlama
(4-bit versions only of course, not 3-bit or 8-bit)
Or the Transformers loader should work. Not AutoGPTQ yet.
feel free to try this new project to serve the model locally. https://github.com/vectorch-ai/ScaleLLM
1: start model inference server
docker run -it --gpus=all --net=host --shm-size=1g \
-v $HOME/.cache/huggingface/hub:/models \
-e HF_MODEL_ID=TheBloke/Yi-34B-GPTQ \
-e DEVICE=auto \
docker.io/vectorchai/scalellm:latest --logtostderr
2: start REST API server
docker run -it --net=host \
docker.io/vectorchai/scalellm-gateway:latest --logtostderr
you will get following running services:
- ScaleLLM gRPC server on port 8888: localhost:8888
- ScaleLLM HTTP server for monitoring on port 9999: localhost:9999
- ScaleLLM REST API server on port 8080: localhost:8080
then send requests
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/Yi-34B-GPTQ",
"prompt": "what is vue.js",
"max_tokens": 32,
"temperature": 0.7
}'
You can also inferencing in textgen with exllamav2
https://huggingface.co/01-ai/Yi-34B/discussions/22#654fb707380ee26b49b3b180