tufa15nik
/

vilt-finetuned-vqasi

Visual Question Answering

Inference Endpoints

Model card Files Files and versions Community

tufa15nik commited on Dec 15, 2022

Commit

c1fa3a1

•

1 Parent(s): 171b061

Delete README.md

Files changed (1) hide show

README.md +0 -79

README.md DELETED Viewed

@@ -1,79 +0,0 @@
----
-tags:
-- visual-question-answering
-license: apache-2.0
-widget:
-- text: "What's the animal doing?"
-  src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
-- text: "What is on top of the building?"
-  src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
----
-# Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2
-Vision-and-Language Transformer (ViLT) model fine-tuned on [VQAv2](https://visualqa.org/). It was introduced in the paper [ViLT: Vision-and-Language Transformer
-Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT).
-Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.
-## Intended uses & limitations
-You can use the raw model for visual question answering.
-### How to use
-Here is how to use this model in PyTorch:
-```python
-from transformers import ViltProcessor, ViltForQuestionAnswering
-import requests
-from PIL import Image
-# prepare image + question
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-text = "How many cats are there?"
-processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
-model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
-# prepare inputs
-encoding = processor(image, text, return_tensors="pt")
-# forward pass
-outputs = model(**encoding)
-logits = outputs.logits
-idx = logits.argmax(-1).item()
-print("Predicted answer:", model.config.id2label[idx])
-```
-## Training data
-(to do)
-## Training procedure
-### Preprocessing
-(to do)
-### Pretraining
-(to do)
-## Evaluation results
-(to do)
-### BibTeX entry and citation info
-```bibtex
-@misc{kim2021vilt,
-      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
-      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
-      year={2021},
-      eprint={2102.03334},
-      archivePrefix={arXiv},
-      primaryClass={stat.ML}
-}
-```