README.md · flax-community/vit-gpt2 at d12ecc29593a4a39e77ea6514eeb7181f976b423

An image captioning model ViT-GPT2 by combining the ViT model and a French GPT2 model.

The GPT2 model source code is modified so it can accept an encoder's output. The pretained weights of both models are loaded, with a set of randomly initialized cross-attention weigths. The model is trained on 65000 images from the COCO dataset for about 1500 steps (batch_size=256), with the original english cpationis being translated to french for training purpose.