Flash Attention
Do you plan to implement Flash Attention 2? Or maybe I am doing something wrong here:
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
attn_implementation="flash_attention_2").eval().cuda()
Getting this error:
Exception has occurred: ValueError
InternVLChatModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co//mnt/disk2/LLM_MODELS/models/MULTIMODAL/Mini-InternVL-Chat-4B-V1-5/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new
File "/home/MULTIMODAL_TESTS/Mini-InternVL-Chat-4B-V1-5_test1.py", line 89, in
model = AutoModel.from_pretrained(
ValueError: InternVLChatModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co//mnt/disk2/LLM_MODELS/models/MULTIMODAL/Mini-InternVL-Chat-4B-V1-5/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new
Hi, thank you for your attention.
The model is configured to enable Flash Attention by default, so no manual setup is required.