Is it possible to add logic for handling output_attentions?
Hi,
I am trying to add GPTQ
support for MPT models in the AutoGPTQ repository. Adding a support for a new model is relatively simpler, for e.g., looking at opt.py script for Facebook's OPT models, all one needs to do is specify names of nn.Linear
layers that need to be quantized.
I did similar for MPT models, however I seem to be running into a problem at this line number. It seems the attentions are not being passed in the kwargs
. How can that be remedied?
Thanks!
Hi
@abhinavkulkarni
, is the ask here if attention_mask
is being passed as a kwarg to the forward
of MPTForCausalLM
?
Hey @sam-mosaic ,
Thanks for the reply. You can see here, output_attentions
options is not specified yet in modeling_mpt.py
: https://huggingface.co/mosaicml/mpt-7b/blob/main/modeling_mpt.py#L140
It would be nice if this if
block were filled up instead of rasing NotImplementedError
. I think it should be trivial given MPT uses traditional transformer, so collecting attention outputs from every hidden layer in the forward
function and then returning it in a tuple.
You can see these line numbers from modeling_opt.py
for a reference:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L245
https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L368
https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L725
Thanks for the great work!
Thanks
@abhinavkulkarni
, I get it now. IIUC, output_attentions
outputs the attention matrix from the attention module?
We do not use the torch
code path much, we usually train with Triton Flash or CUDA Flash. However, neither of those attention implementations can support outputting the attention matrix. So, if we supported this flag it would only be for torch. Does AutoGPTQ mainly focus on lower-resource inference and fine-tuning?
Hey @sam-mosaic ,
So, it seems the recent changes have solved most of the issues, except for line 110 of modeling_mpt.py which needs to be changed from:
return (attn_bias, None)
to
return (attn_bias, attention_mask)
.
I made changes in my local copy of modeling_mpt.py
in site-packages and was able to GPTQ quantize this model using AutoGPTQ repo.
To improve efficiency, in line 109 of modeling_mpt.py, we integrate attention_mask
into attn_bias
if it exists.
If the requested attn_impl does not support an attn bias, then we use attention_mask
(eg attn_impl: flash
does not support attn bias and therefore the output of the _attn_bias
fn is (None, attention_mask)
; see line 88)
This does not control if output_attentions
are available.