Gemma 2's Flash attention 2 implementation is strange...

#23
by GPT007 - opened

I tested with torch.manual_seed(0).

eager attention => normal result
flash attention 2 => 1's not. The 2's "to be's for.3' for4. 2 That 4 2 the 4 that 4 for. 4's 4' to 4''' the 4'' to. 4' 4 4 4to lose to. 4 the' 4 4 4' 4' 4 the 4 the 4 4 4 ...

It is almost the same without any attention

With "eager", it works good.

yes, it should be fixed when you install new version of flash attention from source.

I installed it yesterday 😅
And on windows, so it took a few hours 😨

pip freeze | findstr flash-attn
flash-attn==2.5.9.post1

Took 2 hours, but finally installed flash-attention >= 2.6.0

GPT007 changed discussion status to closed
GPT007 changed discussion status to open
4. 4: 4's over 4. I 4's. 4's. 4.

5's a great. 5' 5' 5' 6' 6. 6' 6' 6' 6' 6. 6' 6' 6' 6' 6 7. 7' 7' 7' 8' 8 8' 8' 8 8 9 9 9 9' 9 9 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

It is still making weird things

pip freeze | findstr flash-attn
flash-attn==2.6.0.post1
pip install --upgrade transformers
  1. Go at tansformers/utils/import_utils and change at line 815:
def is_flash_attn_greater_or_equal(str_version):
    if not _is_package_available("flash_attn"):
        return False

    return version.parse(importlib.metadata.version("flash_attn")) >= version.parse(str_version)

I made a PR

Still facing the same issues...

yeah , I installed 2.6 of 11 July via pip and it still does not work. I thought this was because pip did not include the last fix.

flash-attn 2.6.0.post1

maybe something bad with install, even though I did the last transformers

I'm investigating (hard)

I made two scripts (with torch.manual_seed(0)) EXACTLY SAME excepted the max_new_tokens kwarg.

32 as max_new_tokens gives a normal output, but
1024 as max_new_tokens gives just weird outputs.

I tested with all powers of two and concluded that more max_new_token big is, less precision there is.

1-126 : good
256 or +: bad

Another problem with Flash Attention 2! I'm tired of all that stupid bugs!

GPT007 changed discussion status to closed
GPT007 changed discussion status to open

There is no direct link to the max_new_tokens kwarg

attn_implementation="sdpa"

this works?

SDPA = Scaled Dot Product Attention = Equivalent of Flash Attention but in native PyTorch

aha, but does this work?

But SDPA is ~ 5 times slower

So I recommend just google/gemma-2b or google/gemma-9b

Why would you use it, with FA it is not that slow.

But FA 2 with Gemma 2 is bugging

I mean without using FA at all, it is faster than with SDPA or not. I will try, however.

I made tests:

The eager attention implementation took 13.01 seconds to complete on 100 tokens with a context length of 1k.
The sdpa attention implementation took 11.83 seconds to complete on 100 tokens with a context length of 1k.
The flash_attention_2 attention implementation took 11.30 seconds to complete on 100 tokens with a context length of 1k (but is weird).

Awesome, sdpa a bit faster probably uses less memory as well.

For me FA2 generates much longer text so it is around 2x slower than Pytorch.

This comment has been hidden

I get the same time and memory usage for both sdpa and eager. So probably it does not work. Around 9.3gb VRAM in 4 bits.

And installed 2.6.1, still it does not work well.

What is your GPU?

not sure, should flash attention work with 4bits as well?

The script: script

The script worked for you, I use quite similar one? I guess soon there will be FA update .

flash attention works with bitsandbytes

Thanks for clarifying, then FA2 is the problem. Will do complete install after

The difference of time between them is mainly caused by the time to first token so it's not a problem to use FA1 (alias SDPA).

Started process.
The eager attention implementation took 1.53s to first token.
Started process.
The sdpa attention implementation took 0.14s to first token.
Started process.
The flash_attention_2 attention implementation took 0.15s to first token.

you can now transformers-4.43.0.dev0 from source!

But it doesn't fix FA2:

Started process.
The eager attention implementation took 13.43s to infer 100 tokens.
The eager attention works great!
Started process.
The sdpa attention implementation took 12.06s to infer 100 tokens.
The sdpa attention works great!
Started process.
The flash_attention_2 attention implementation took 11.53s to infer 100 tokens.
The flash_attention_2 attention is weird!

With transformers-4.43.0.dev0 and flash-attn-2.6.1, flash attention 2 seems biased but not too much.
On longer context, he fails.

Wait... I think I've got it

GPT007 changed discussion status to closed
GPT007 changed discussion status to open
GPT007 changed discussion status to closed

what is the solution actually?

Use Gemma 1 👇

@rsdfsfas SOLUTION:

  1. pip install git+https://github.com/zucchini-nlp/transformers@gemma2
  2. pip install git+https://github.com/huggingface/transformers when This PR will be merged!
  3. When transformers 4.44.0 will release on pip, do simply pip install transformers!

Perfect, thanks.

Also seems like just 4-5 lines to add in gemma2 modeling file, so it is pretty easy to install.

It works just by changing these lines, it is a bit slower than without flash attention and it use the same amount of memory.

Maybe there is still something broken.

It does output good response.

Started process with the eager attn_implementation.
The eager attn_implementation took 15.17s to infer {tokens} tokens.
Started process with the sdpa attn_implementation.
The sdpa attn_implementation took 21.51s to infer {tokens} tokens.
Started process with the flash_attention_2 attn_implementation.
The flash_attention_2 attn_implementation took 30.53s to infer {tokens} tokens.

Yes, something very wrong. Probably won't be fixed.

This might fix this, at least memory part: https://github.com/huggingface/transformers/pull/31292

I know, but we need to ask to apply it to gemma2, not only in gemma (1).

GPT007 changed discussion status to open

All Gemmas are is included, as far as I know.

I looked at the commits, and it changed the global generation utils AND for "the most used models" that includes Gemma (1), not Gemma 2.

Gemma2 was not released yet when I started this, but don't worry I will add it as well, it's on the roadmap 🤗

Sign up or log in to comment