What does it mean by the inject_fused_attention disabled for 70B model?
#24
by
neo-benjamin
- opened
What does it mean by the inject_fused_attention
has to be disabled for 70B model?
It means that if you use AutoGPTQ from Python code you need to set inject_fused_attention=False
in your AutoGPTQForCausalLM.from_quantized()
call - look at the Python code example I gave, it's already set there.
This is because Llama 2 70B changes the model architecture and AutoGPTQ needs to be updated to reflect this, which will happen soon.