databricks/dbrx-instruct · Bug on AMD MI 250 with flash-attention

Mar 28

Hello,

Thanks a lot for the model and congrats for publishing a strong model.

The current model is not working with AMD MI250 with flash attention:

Concretly take a node of MI250 :
load with attn_implementation="flash_attention_2"

If you load without flash attention this is working. Other MOe seems to be working (mixtral, grok) !

Congrats again hopefully will be working on @amd soon :)
Pierre

Mar 28

Btw no problem on A100 :)

daking

Databricks org Mar 28

Does your setup work for other models with flash attention (e.g. llama)? What is the error you get?

Mar 28

Yes working both inference and training.

No errors just the model generates crap you can check the screenshot