use Flash Attention

by kakascode - opened Aug 26

Aug 26

I attempted to use Flash Attention, but encountered the following error: NewModel does not support Flash Attention 2.0 yet. The model gte-multilingual-base does not yet support Flash Attention 2.0 ?

thenlper

Alibaba-NLP org Aug 26

Could you please paste the code for your model inference here? It would help us with debugging.

kakascode

Aug 27

Could you please paste the code for your model inference here? It would help us with debugging.

model = AutoModel.from_pretrained(model_path, trust_remote_code=True, attn_implementation="flash_attention_2" )
ValueError: NewModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted

izhx

Alibaba-NLP org Aug 27

The xformers has flash attention 2 kernel, and will dispatch to it when on the appropriate device and data type, ref to https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment