compatible with Llama
#29
by
cArlIcon
- opened
No description provided.
richardllin
changed pull request status to
open
richardllin
changed pull request status to
merged
Yi-34B's generation became 10x slower on 4xA10 GPUs after replacing YiForCausalLM with LlamaForCausalLM.
Any idea why?
Hi
@rodrigo-nogueira
not sure what's the root cause, but do you want to give Flash Attention a try by invoking the model with use_flash_attention_2=True
?
More context can be found from:
https://huggingface.co/docs/transformers/v4.35.2/en/perf_infer_gpu_one#Flash-Attention-2
Thank you very much, it is much faster now.