flax-community
/

vit-gpt2

Model card Files Files and versions Metrics Training metrics Community

vit-gpt2 / README.md

ydshieh's picture

ydshieh HF staff

Update README.md

d12ecc2 over 3 years ago

|

658 Bytes

	An image captioning model [ViT-GPT2](https://huggingface.co/flax-community/vit-gpt2/tree/main) by combining the ViT model and a French GPT2 model.

	Part of the [Huggingface JAX/Flax event](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/).

	The GPT2 model source code is modified so it can accept an encoder's output.
	The pretained weights of both models are loaded, with a set of randomly initialized cross-attention weigths.
	The model is trained on 65000 images from the COCO dataset for about 1500 steps (batch\_size=256), with the original english cpationis being translated to french for training purpose.