dandelin
/

vilt-b32-finetuned-vqa

Visual Question Answering

Inference Endpoints

Model card Files Files and versions Community

nielsr HF staff commited on Nov 26, 2021

Commit

ad5d7e3

•

1 Parent(s): 89c23fb

Create README.md

Files changed (1) hide show

README.md +56 -0

README.md ADDED Viewed

	@@ -0,0 +1,56 @@

+---
+license: apache-2.0
+tags:
+datasets:
+- imagenet-21k
+---
+# Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2
+Vision-and-Language Transformer (ViLT) model fine-tuned on [VQAv2](). It was introduced in the paper [ViLT: Vision-and-Language Transformer
+Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT).
+Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.
+## Model description
+(to do)
+## Intended uses & limitations
+You can use the raw model for visual question answering.
+### How to use
+(to do)
+## Training data
+(to do)
+## Training procedure
+### Preprocessing
+(to do)
+### Pretraining
+(to do)
+## Evaluation results
+(to do)
+### BibTeX entry and citation info
+```bibtex
+@misc{kim2021vilt,
+      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
+      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
+      year={2021},
+      eprint={2102.03334},
+      archivePrefix={arXiv},
+      primaryClass={stat.ML}
+}
+```