What does it mean by the inject_fused_attention disabled for 70B model?

#24
by neo-benjamin - opened

What does it mean by the inject_fused_attention has to be disabled for 70B model?

It means that if you use AutoGPTQ from Python code you need to set inject_fused_attention=False in your AutoGPTQForCausalLM.from_quantized() call - look at the Python code example I gave, it's already set there.

This is because Llama 2 70B changes the model architecture and AutoGPTQ needs to be updated to reflect this, which will happen soon.

Sign up or log in to comment