16-bit version?
Do you have plans to upload a 16bit version of your model? That would make it a lot more accessible for inference on smaller GPUs.
@dirkgr
Can correct me but I am not aware of such plans. You should be able to load the model and then call, say,model = model.bfloat16()
to convert the weights to 16 bits. You may need to load the model on the CPU, downcast to 16 bits, and then move the model to GPU. An alternative with a higher memory requirements (that we used while training the model) is to use torch.autocast
with a 16 bit type.
@shanearora I completely get that, but if I’m loading in the model with vLLM then I get OOM errors before any conversion can happen. I guess I could convert it and upload it myself, but it would just be a bit more official if you all had a 16bit version uploaded. Same thing with quantised and GGUF versions for that matter, as these are required by other applications like llama.cpp and LM Studio. But it’s up to you - feel free to close this issue if you’re not planning on it 🙂
@akshitab Do you know about OLMo plans in relation to vLLM?
vLLM integration for OLMo is currently in progress here: https://github.com/vllm-project/vllm/issues/2763