How to handle padding?

#5
by GabrielePrato - opened

What is the correct way to handle padding with RedPajama? Currently I do the following:

train_tokenizer = AutoTokenizer.from_pretrained('togethercomputer/RedPajama-INCITE-7B-Base', use_fast=True)
eval_tokenizer = AutoTokenizer.from_pretrained('togethercomputer/RedPajama-INCITE-7B-Base', use_fast=True, padding_side='left')
train_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
eval_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

train_inputs = train_tokenizer(train_prompts, padding=True, return_tensors='pt')
train_output = model(**train_inputs) # model(input_ids=train_inputs.input_ids, attention_mask=train_inputs.attention_mask)

eval_inputs = eval_tokenizer(eval_prompts, padding=True, return_tensors='pt')
eval_output = model.generate(**eval_inputs, max_new_tokens=128) # model.generate(input_ids=eval_inputs.input_ids, attention_mask=eval_inputs.attention_mask)

Sign up or log in to comment