Bug on AMD MI 250 with flash-attention
#13
by
PierreColombo
- opened
Hello,
Thanks a lot for the model and congrats for publishing a strong model.
The current model is not working with AMD MI250 with flash attention:
Concretly take a node of MI250 :
load with attn_implementation="flash_attention_2"
If you load without flash attention this is working. Other MOe seems to be working (mixtral, grok) !
Congrats again hopefully will be working on @amd soon :)
Pierre
Btw no problem on A100 :)
Does your setup work for other models with flash attention (e.g. llama)? What is the error you get?
Yes working both inference and training.
No errors just the model generates crap you can check the screenshot