Memory optimization for long sequence with many small images: reduce resampler_n_latents
Hi there,
Congrats on releasing this amazing model! I am fine-tuning it for a VQA task involving 10 or 20 low-res images (224x224 pixels), and I have followed https://huggingface.co/HuggingFaceM4/idefics2-8b#model-optimizations and set do_image_splitting=False
and size
for the processor. Still, I seem bottlenecked by sequence length (thus GPU memory) because most tokens are 64 <image>
tokens per image.
I am fiddling with the idea of reducing config.perceiver_config.resampler_n_latents
from the default 64 to 32 or even 16. Is it possible at all, to reuse existing weights but use fewer than 64 latents in idefics2-8B?
Thanks!
Thanks for your comment
Yes, with do_image_splitting=False
you will use 64 tokens per image.
Do you mean that you have 10 to 20 images per example for your task?
Idefics2-base has a maximum sequence length of 2048, while we used a maximum of 1024 for the SFT that led to Idefics2, so exceeding this number might give unexpected results (but since you fine-tune it, it can also learn to go beyond of course).
It's not recommended to change the config.perceiver_config.resampler_n_latents
.
However, if you really want to encode your image with a super low number of tokens, you could do an average pooling on the 64 tokens to make it 32 or 16, and fine-tune it this way.
note @ch272h that technically, the modeling supports out of the box much longer sequences through mistral window attention, we just tune up to 2048. so if you are open to fine-tuning on long sequences, you can do that out of the box without even doing additional pooling. that would hopefully close the gap you will potentially see when exceeding the sequence lengths we trained on.
Thank you so much for the recommendations. @HugoLaurencon I will look into your suggestion.
Feel free to open this discussion again if you have any problem