TheBloke/Yi-34B-GPTQ · How do i run it?

Nov 6, 2023

Sry, which loader works currently? I've tried some for no success, but it could be me doing smth wrong.

Nov 6, 2023

With oobagooba and ExLlama_HF loader got the following error : KeyError: 'model.embed_tokens.weight'
Other gptq model load without issue, example TheBlokeSynthia-34B-v1.2-GPTQ load without error (using 21Go Vram)

TheBloke

Owner Nov 6, 2023

I heard that ExLlama added support for Yi recently, so you might just need to update ExLlama

(4-bit versions only of course, not 3-bit or 8-bit)

Or the Transformers loader should work. Not AutoGPTQ yet.

guocuimi

Nov 8, 2023

•

edited Nov 11, 2023

feel free to try this new project to serve the model locally. https://github.com/vectorch-ai/ScaleLLM
1: start model inference server

docker run -it --gpus=all --net=host --shm-size=1g \
  -v $HOME/.cache/huggingface/hub:/models \
  -e HF_MODEL_ID=TheBloke/Yi-34B-GPTQ \
  -e DEVICE=auto \
  docker.io/vectorchai/scalellm:latest --logtostderr

2: start REST API server

docker run -it --net=host \
  docker.io/vectorchai/scalellm-gateway:latest --logtostderr

you will get following running services:

ScaleLLM gRPC server on port 8888: localhost:8888
ScaleLLM HTTP server for monitoring on port 9999: localhost:9999
ScaleLLM REST API server on port 8080: localhost:8080

then send requests

curl http://localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "TheBloke/Yi-34B-GPTQ",
    "prompt": "what is vue.js",
    "max_tokens": 32,
    "temperature": 0.7
  }'

Yhyu13

Nov 14, 2023

@Yuuru

You can also inferencing in textgen with exllamav2

https://huggingface.co/01-ai/Yi-34B/discussions/22#654fb707380ee26b49b3b180