Here is a simple multimodal like training script to see model working.
https://github.com/grahamannett/finetune-fuyu/blob/main/train-simple.py
If anyone would like to test their machine with fuyu, here is a small script that makes fake text + images but is a complete training loop. It is all self-contained and only needs transformers/torch/simple_parsing installed.
The idea is that since you may not know if the model will fit on your resources, better to try this before digging into FSDP/QLoRA/Accelerate.
I can add an FSDP/Accelerate/QLoRA example as well since those can be hard to get working with this model with limited resources.
Can FuyuProcessor be modified to handle both multi-resolution and multiple images?
I looked through its code and noticed it only processes one image at a time and doesn't support this feature.
It would be great if the training process could support settings for both multi-resolution and multi-image processing.
FuyuProcessor handles multi-resolution images and multiple images so long as they are each a different sample.
The current model does not allow multiple images per sample but it does seem to work with them if you change gather_continuous_embeddings