Extremely high GPU requirements for both basic (demo.ipynb) and batch (batch_inference.ipynb) notebooks

by dwb2023 - opened Aug 21

dwb2023

Aug 21

•

I created some updated notebooks for both basic and batch inference.

GPU requirements were much higher that expected based on the use of Phi-3 mini for LLM. I couldn't find any xgen model documentation recommending use of flash-attention but that may help.

Don't take my word for it, I recommend running the code to verify behavior.

I take responsibility for any issues with the markdown -- the examples in the model repository were pure code.

Basic Inference:
https://colab.research.google.com/drive/1suykCYjRUzJBDQaBJQQqyPzq8vJ9bg6w?usp=sharing

Batch inference:
https://colab.research.google.com/drive/1CklfRSGN95QqoDK8VVNUfamtUisqyRp7?usp=sharing

Manli

Salesforce org Aug 23

Hi @dwb2023 ,

Thank you for sharing the Colab notebook.
The GPU memory usage for inference with full precision (fp32) will be around 16 GB for our model (which has around 4GB parameters).
In our local dev code, we confirm that running the model in bf16 precision with flash-attention can get us the same results most of the time, so it is possible to reduce the memory usage by loading our model in bf16 precision.

I'll update with a bf16 inference code in the following days. You're also more than welcome to contribute to this. Thank you for your feedback!

Manli

Salesforce org Sep 4

Hi @dwb2023 ,
This is relevant to the high memory usage issue you mentioned, so I'll update it in this thread.

I just made a change to our demo inference notebook. In this commit, I overwrite the eos_token in our tokenizer, so the model will stop at <|end|> as expected (instead of generating nonstop until the max length is hit, which causes the high memory usage.)
And here's a more readable change list on GitHub for your reference.

(And just to clarify, we just caught this issue because we don't have it in our local evaluation/inference. We've been using the model and tokenizer defined in our local training code that differs from the Huggingface converted one, so we don't have this issue with our local models.)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment