flax-community
/

vit-gpt2

Model card Files Files and versions Metrics Training metrics Community

ydshieh HF staff commited on Jul 19, 2021

Commit

747dd4c

•

1 Parent(s): 4e37291

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -16,4 +16,8 @@ The model is trained on 65000 images from the COCO dataset for about 1500 steps
 - The provided training script `run_summarization.py` is modified to send pixel values to the model instead of a sequence of input token ids, and a necessary change due to the ViT model not accepting an `attention_mask` argument.
 A HuggingFace Spaces demo for this model: [🖼️ French Image Captioning Demo 📝](https://huggingface.co/spaces/flax-community/image-caption-french)

 - The provided training script `run_summarization.py` is modified to send pixel values to the model instead of a sequence of input token ids, and a necessary change due to the ViT model not accepting an `attention_mask` argument.
+- We first tried to use [WIT : Wikipedia-based Image Text Dataset](https://github.com/google-research-datasets/wit), but found it is a very changeling task since, unlike traditional image captioning tasks, it requires the model to be able to generate different texts even if two images are similar (for example, two famous dogs might have completely different Wikipedia texts).
+- We finally decided to use [COCO image dataset](https://cocodataset.org/#home) at the final day of this Flax community event. We were able to translate only about 65000 examples to French for training, and the model is trained for only 5 epochs (beyond this, it started to overfit). This leads to the poor performance.
 A HuggingFace Spaces demo for this model: [🖼️ French Image Captioning Demo 📝](https://huggingface.co/spaces/flax-community/image-caption-french)