IndexError: index out of range in self
Hi, I'm running the provided example in the model card and I'm getting the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-8-5181f41be2fc> in <cell line: 1>()
----> 1 torch_outs = model(
2 tokens_ids,
3 attention_mask=attention_mask,
4 encoder_attention_mask=attention_mask,
5 output_hidden_states=True
8 frames
/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2208 # remove once script supports set_grad_enabled
2209 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2211
2212
IndexError: index out of range in self
Looking at the model and the tokenizer's vocab sizes, I see a mismatch. Could that be the problem or I'm missing something else?
model.config.vocab_size
> 4105
tokenizer.vocab_size
> 4107
Hi
@phosseini
- this error is fixed by a PR we pushed to transformers
, but which is unfortunately only available on main
right now. Please try installing from main with pip install --upgrade git+https://github.com/huggingface/transformers.git
and see if that fixes your issue!
Hi @Rocketknight1 ,
I've been tooling around trying to fine-tune this model to a classifier. I was able to get the model through the trainer but when I go to run inference, I am getting the same issue listed above.
The model was trained with this config:
architectures:
0: "EsmForSequenceClassification"
attention_probs_dropout_prob:0
emb_layer_norm_before:false
esmfold_config:null
hidden_dropout_prob:0
hidden_size:1280
id2label:
0:"LABEL_0"
1:"LABEL_1"
2:"LABEL_2"
3:"LABEL_3"
4:"LABEL_4"
5:"LABEL_5"
6:"LABEL_6"
initializer_range:0.02
intermediate_size:5120
is_folding_model:false
label2id:
LABEL_0:0
LABEL_1:1
LABEL_2:2
LABEL_3:3
LABEL_4:4
LABEL_5:5
LABEL_6:6
layer_norm_eps:1e-12
mask_token_id:2
max_position_embeddings:1002
model_type:"esm"
num_attention_heads:20
num_hidden_layers:24
pad_token_id:1
position_embedding_type:"absolute"
problem_type:"single_label_classification"
tie_word_embeddings:false
token_dropout:true
torch_dtype:"float32"
transformers_version:"4.30.0.dev0"
use_cache:false
vocab_list:null
vocab_size:4105
I am using the InstaDeepAI/nucleotide-transformer-500m-1000g
tokenizer and have the same transformer version 4.30.0.dev0
loaded in my notebook which I can run the fill-mask model from the card successfully.
Any ideas would be appreciated.
Hi @esko2213 , can you send me some code to reproduce the issue?
Thanks for the quick response.
I had been deploying the the model through SageMaker and calling it as such....
predictor = huggingface_estimator.deploy(1, "ml.m5.xlarge")
input_sequence= {"inputs":"CAGCATTTTGAATTTGAATACCAGACCAAAGTGGATGGTGAGATAATCCTTCATCTTTATGACAAAGGAGGAATTGAGCAAACAATTTGTATGTTGGATGGTGTGTTTGCATTTGTTTTACTGGATACTGCCAATAAGAAAGTGTTCCTGGGTAGAGATACATATGGAGTCAGACCTTTGTTTAAAGCAATGACAGAAGATGGATTTTTGGCTGTATGTTCAGAAGCTAAAGGTCTTGTTACATTGAAGCACTCCGCGACTCCCTTTTTAAAAGTGGAGCCTTTTCTTCCTGGACACTATGAAGTTTTGGATTTAAAGCCAAATGGCAAAGTTGCATCCGTGGAAATGGTTAAATATCATCACTGTCGGGATGTACCCCTGCACGCCCTCTATGACAATGTGGAGAAACTCTTTCCAGGTTTTGAGATAGAAACTGTGAAGAACAACCTCAGGATCCTTTTTAATAATGCTGTAAAGAAACGTTTGATGACAGACAGAAGGATTGGCTGCCTTTTATCAGGGGGCTTGGACTCCAGCTTGGTTGCTGCCACTCTGTTGAAGCAGCTGAAAGAAGCCCAAGTACAGTATCCTCTCCAGACATTTGCAATTGGCATGGAAGACAGCCCCGATTTACTGGCTGCTAGAAAGGTGGCAGATCATATTGGAAGTGAACATTATGAAGTCCTTTTTAACTCTGAGGAAGGCATTCAGGCTCTGGATGAAGTCATATTTTCCTTGGAAACTTATGACATTACAACAGTTCGTGCTTCAGTAGGTATGTATTTAATTTCCAAGTATATTCGGAAGAACACAGATAGCGTGGTGATCTTCTCTGGAGAAGGATCAGATGAACTTACGCAGGGTTACATATATTTTCACAAGGCTCCTTCTCCTGAAAAAGCCGAGGAGGAGAGTGAGAGGCTTCTGAGGGAACTCTATTTGTTTGATGTTCTCCGCGCAGATCGAACTACTGCTGCCCATGGTCTTGAACTGAGAGTCCCATTTCTAGATCATCGATTTTCTTCCTATTACTTGTCTCTGCCACCAGAAATGAGAATTCCAAAGAATGGGATAGAAAAACATCTCCTGAGAGAGACGTTTGAGGATTCCAATCTGATACCCAAAGAGATTCTCTGGCGACCAAAAGAAGCCTTCAGTGATGGAATAACTTCAGTTAAGAATTCCTGGTTTAAGATTTTACAGGAATACGTTGAACATCAGGTTGATGATGCAATGATGGCAAATGCAGCCCAGAAATTTCCCTTCAATACTCCTAAAACCAAAGAAGGATATTACTACCGTCAAGTCTTTGAACGCCATTACCCAGGCCGGGCTGACTGGCTGAGCCATTACTGGATGCCCAAGTGGATCAATGCCACTGACCCTTCTGCCCGCACGCTGACCCACTACAAGTCAGCTGTCAAAGCTTAG"}
predictor.predict(input_sequence)
IndexError: index out of range in self
which produced the same error reported above. I've tried it with a single string and an array of strings as input, both throw the error.
I have not pushed it to the Hub yet but wanted to see what happens if I run it locally first...
Running it locally via pipeline -
from transformers import pipeline
pipeline = pipeline(task="text-classification", model="./nucl_class_model/")
for x in pipeline("CAGCATTTTGAATTTGAATACCAGACCAAAGTGGATGGTGAGATAATCCTTCATCTTTATGACAAAGGAGGAATTGAGCAAACAATTTGTATGTTGGATGGTGTGTTTGCATTTGTTTTACTGGATACTGCCAATAAGAAAGTGTTCCTGGGTAGAGATACATATGGAGTCAGACCTTTGTTTAAAGCAATGACAGAAGATGGATTTTTGGCTGTATGTTCAGAAGCTAAAGGTCTTGTTACATTGAAGCACTCCGCGACTCCCTTTTTAAAAGTGGAGCCTTTTCTTCCTGGACACTATGAAGTTTTGGATTTAAAGCCAAATGGCAAAGTTGCATCCGTGGAAATGGTTAAA"):
print(x)
IndexError: index out of range in self
However, when I run it via AutoModel and remove the encoder_attention_mask
# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-1000g")
model = AutoModelForSequenceClassification.from_pretrained("./nucl_class_model/", local_files_only=True, )
sequences = ['ATGCCCCAACTAAATACTACCGTATGGCCCACCATAATTACCCCCATACTCCTTACACTATTCCTCATCACCCAACTAAAAATATTAAACACAAACTACCACCTACCTCCCTCACCAAAGCCCATAAAAATAAAAAATTATAACAAACCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAG']
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt")["input_ids"]
attention_mask = tokens_ids != tokenizer.mask_token_id
torch_outs = model(
tokens_ids,
attention_mask=attention_mask,
#encoder_attention_mask=attention_mask,
output_hidden_states=True
)
probs = torch.softmax(torch_outs.logits, dim=1)
I can get probabilities
tensor([[2.8926e-04, 2.9352e-04, 3.5966e-04, 1.9120e-04, 1.0060e-03, 4.8940e-05,
9.9781e-01]], grad_fn=<SoftmaxBackward0>)
Let me know if you want me to push up the model.