How could I deploy liuhaotian/llava-v1.5-7b on a server?

#7
by andreydmitr20 - opened

Hi,
I've got llama.cpp working with ggml-model-q4_k.gguf on my notebook.
Now, I'm trying to run:

python3 -m llava.serve.controller --host 0.0.0.0 --port 10000
python3 -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-7b --load-4bit

and getting error:

modeling_utils.py", line 2842, in from_pretrained
2023-11-21 16:41:10 | ERROR | stderr | raise ValueError(
2023-11-21 16:41:10 | ERROR | stderr | ValueError:
2023-11-21 16:41:10 | ERROR | stderr | Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
2023-11-21 16:41:10 | ERROR | stderr | the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
2023-11-21 16:41:10 | ERROR | stderr | these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
2023-11-21 16:41:10 | ERROR | stderr | device_map to from_pretrained. Check
2023-11-21 16:41:10 | ERROR | stderr | https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
2023-11-21 16:41:10 | ERROR | stderr | for more details.

What should I do to run this model with CPU only?
Thanks.

@andreydmitr20 could you please help me out to fine tune this model. could you please let me know when can we connect on this?

Sign up or log in to comment