How do you fine tune LLaVA NeXT?

#5
by Nishgop - opened

Is there a way to fine tune LLaVA-NeXT?

Llava Hugging Face org

cc @lewtun the TRL team is going to make it super easy to fine-tune models like these.

For now I'll refer you to my demo notebook, which includes a bunch of utilities from the original LLaVa repository.

Thanks Niels, This is great!
I assume the same approach works also for LLaVA-NeXT. Is that correct?

Nishant

Llava Hugging Face org

Yes it should, although Llava-NeXT is a bit more complex compared to Llava in terms of image preprocessing. A PR to add batched generation (which should also solve training issues) is here: https://github.com/huggingface/transformers/pull/29850.

For now I'd recommend either Llava or Idefics2. Refer to my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_JSON_extraction_use_cases_(PyTorch_Lightning).ipynb. Have tested this with both models.

Hi @nielsr , thanks for all the work! If I understand correctly, as the PR you mentioned above has been merged, training should now work properly for LLaVA-Next (LLaMA 8B + 72B and 110B) models and it already worked for LLaVA1.6? Do you know of any example scripts or articles?

Llava Hugging Face org

Hi @lcolonn ! Yes, the PR was merged and LLaVa-NeXT is tunable now. Fine-tuning script is almost the same as LLaVa with a few changes in input arguments, find here my adaptation of Niels' notebook

Hey @RaushanTurganbay , very cool! I was a little confused because in the PR it also says that it's fine-tunable but for cases without images. Also if you are using llava-v1.6-mistral-7b-hf shouldn't you be using the following prompt format: "[INST] \n What is shown in this image? [/INST]" as described here: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next

Llava Hugging Face org

Yes that's right, LLaVa-NeXt does not have a chat template yet which means that for now you need to manually make sure that the right format is used. Looks like @RaushanTurganbay might need to update that

Llava Hugging Face org

Oke, thanks for noting. Will change it in the notebook and I will try to add chat templates to all Llava models

Hi @nielsr , sorry it's still not quite clear to me whether training for LLaVA-Next supports training with batched (images). It did say in this PR that only support for training without images was added: https://github.com/huggingface/transformers/pull/29850

Llava Hugging Face org

I updated the comment in PR to (with and w/o images). The model should be tunable with images as well

This comment has been hidden

@RaushanTurganbay , thanks for sharing the notebook on finetuning LLaVA-NEXT! Is there a similar one for finetuning LLaVA-NEXT-Video? or can I easily adapt this notebook for LLaVA-NEXT-Video as well? @nielsr

Llava Hugging Face org

Yes here it is: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VideoLLaVa. Should be very similar for LLaVA-Next-Video.

Llava Hugging Face org

There is actually a notebook for llava-next-video here, I will port it to the Tutorials repo for easier discovery

Hey, thanks so much for the great examples! Trying to follow along, but I have only small GPUs and try to use Deepspeed. Do you know if your code would work with Deepspeed on 4 GPUs?

Llava Hugging Face org

For DeepSpeed we support it when using Trainer but the example notebook relies on custom trainer. Take a look at (https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/deepspeed#deepspeed-non-trainer-integration) for more information on how to use deepspeed with custom Trainers

Sorry a little of a different question.. How many images and/or videos can LLaVA-Next-Video take? I couldn't find it stated elsewhere. Thanks in advance. @RaushanTurganbay @nielsr

Llava Hugging Face org

@tjiang217 LLaVA-Next-Video was not trained in multi-image/multi-video setting afaik, but it doesn't mean we can't try and feed several visuals. But note that the generation quality might not be as good as in single image.

You can also take a look at https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19, which were trained for interleaved images/videos. It doesn't state however how many images/videos per prompt was used in train, I guess it was 2 images/videos in most examples

@RaushanTurganbay I tried to run the llava-next-video finetuning notebook you shared without changing any code on 4 A10 GPU ec2 instance and ran into the following issue. The inference code works just the training part. Do you have any ideas why? It has to do with device_map = 'auto' but putting on one gpu causes CUDA out of memory error. Any help would be greatly appreciated

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

@RaushanTurganbay sorry just wanted to follow up here. I was able to bypass the previous bug when I make the batch size smaller and remove device_map = 'auto', but ran into the following bug using the same code in the llava-next-video finetuning notebook. Do you know for this notebook, which transformers version you used and other package versions? Thanks in advance!

Error I ran into.
RuntimeError: Input tensor at index 1 has invalid shape [1, 1595, 32064], but expected [1, 1500, 32064]

Llava Hugging Face org

Further discussion/solutions will be in https://github.com/huggingface/trl/issues/1785#issuecomment-2314793662 for anyone having the same issue

What changes i need to make in the notebook if my dataset is unique_id, image and conversations. I can't see any notebook using conversations to train.

Llava Hugging Face org

You can find SFT tuning example for VLMs here (https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py). But the general idea is same, and you just have to prepare the inputs in the format you want and thus write your own data collator. You can also take a look at how LLMs are tuned with dialog datasets to see how the inputs have to be formatted/masked

@RaushanTurganbay I understand the current llava-next-video model processes each frame as 12x12 tokens (result of 2 stride pooling from 24x24 tokens), I am working with a soccer video dataset that has fine-grain details, such as the soccer ball, so I thought the 12x12 tokens may not be able capture enough details. The LLaVA-next-video blog talked about testing different variation of pooling strides. Do you know if we could tweak the current model or access the other model so the number of tokens representing each frame is greater than 12x12 tokens?

Thanks in advance, much appreciated!

Sign up or log in to comment