What does it mean by the inject_fused_attention disabled for 70B model?

#24

by neo-benjamin - opened Jul 27, 2023

Discussion

neo-benjamin

Jul 27, 2023

What does it mean by the inject_fused_attention has to be disabled for 70B model?

TheBloke

Owner Jul 27, 2023

It means that if you use AutoGPTQ from Python code you need to set inject_fused_attention=False in your AutoGPTQForCausalLM.from_quantized() call - look at the Python code example I gave, it's already set there.

This is because Llama 2 70B changes the model architecture and AutoGPTQ needs to be updated to reflect this, which will happen soon.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment