Extremely high GPU requirements for both basic (demo.ipynb) and batch (batch_inference.ipynb) notebooks

#3
by dwb2023 - opened

I created some updated notebooks for both basic and batch inference.

GPU requirements were much higher that expected based on the use of Phi-3 mini for LLM. I couldn't find any xgen model documentation recommending use of flash-attention but that may help.

Don't take my word for it, I recommend running the code to verify behavior.

I take responsibility for any issues with the markdown -- the examples in the model repository were pure code.

Basic Inference:
https://colab.research.google.com/drive/1suykCYjRUzJBDQaBJQQqyPzq8vJ9bg6w?usp=sharing

Batch inference:
https://colab.research.google.com/drive/1CklfRSGN95QqoDK8VVNUfamtUisqyRp7?usp=sharing

Salesforce org

Hi @dwb2023 ,

Thank you for sharing the Colab notebook.
The GPU memory usage for inference with full precision (fp32) will be around 16 GB for our model (which has around 4GB parameters).
In our local dev code, we confirm that running the model in bf16 precision with flash-attention can get us the same results most of the time, so it is possible to reduce the memory usage by loading our model in bf16 precision.

I'll update with a bf16 inference code in the following days. You're also more than welcome to contribute to this. Thank you for your feedback!

Salesforce org

Hi @dwb2023 ,
This is relevant to the high memory usage issue you mentioned, so I'll update it in this thread.

I just made a change to our demo inference notebook. In this commit, I overwrite the eos_token in our tokenizer, so the model will stop at <|end|> as expected (instead of generating nonstop until the max length is hit, which causes the high memory usage.)
And here's a more readable change list on GitHub for your reference.

(And just to clarify, we just caught this issue because we don't have it in our local evaluation/inference. We've been using the model and tokenizer defined in our local training code that differs from the Huggingface converted one, so we don't have this issue with our local models.)

Sign up or log in to comment