getting RuntimeError: No executable batch size found, reached zero. erorr when trying to fine-tuning flan-ul2 model.

#16
by cyt79 - opened

Hi there,

I'm trying to finetune flan-ul2 model with LoRA as explained here (https://www.philschmid.de/fine-tune-flan-t5-peft) . First I walked through the blog post without changing anything and I could finetune flan-t5-xxl model. Then, I tried to do same with flan-ul2. All I did was to change model and tokenizer initialization lines as follows:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq

#model_id="google/flan-t5-xxl"
model_id="google/flan-ul2"
tokenizer = AutoTokenizer.from_pretrained(model_id)

#model_id = "philschmid/flan-t5-xxl-sharded-fp16"
model_id = "google/flan-ul2"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

Then I run the trainer as shown below:

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)

When I run trainer.train() with the above setup, I got the following error:

Traceback (most recent call last):                                                                                                                                                                                                                                                                  | 0/73660 [00:00<?, ?it/s]
  File "peft_finetuning_flan-ul2.py", line 145, in <module>
    trainer.train()
  File "/home/ubuntu/miniconda3/envs/finetuning/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/finetuning/lib/python3.10/site-packages/accelerate/utils/memory.py", line 122, in decorator
    raise RuntimeError("No executable batch size found, reached zero.")
RuntimeError: No executable batch size found, reached zero.

So I wonder if there is something special about the flan-ul2 model that would prevent me from using it in this way. Would that be because Seq2SeqTrainer, and Seq2SeqTrainingArgumentsnot the correct Trainer and TrainingArg classes to use with flan-ul2? (I've tried the regular Trainer and TrainingArguments classes as well but I got the same error) If so, could you please assist me on the correct ones?

@ybelkada or @stas do you have any idea about this?

Hey! Do you solve it now?

@cyt79
auto_find_batch_size=True is responsible for this error.
Make auto_find_batch_size=False.
Manually give per_device_train_batch_size=8 and per_device_eval_batch_size=8.
If per_device_train_batch_size=8 throws cuda error , reduce the batch size until cuda error doesnt occur.

Sign up or log in to comment