tufa15nik commited on
Commit
c1fa3a1
1 Parent(s): 171b061

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -79
README.md DELETED
@@ -1,79 +0,0 @@
1
- ---
2
- tags:
3
- - visual-question-answering
4
- license: apache-2.0
5
- widget:
6
- - text: "What's the animal doing?"
7
- src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
8
- - text: "What is on top of the building?"
9
- src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
10
- ---
11
-
12
- # Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2
13
-
14
- Vision-and-Language Transformer (ViLT) model fine-tuned on [VQAv2](https://visualqa.org/). It was introduced in the paper [ViLT: Vision-and-Language Transformer
15
- Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT).
16
-
17
- Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.
18
-
19
- ## Intended uses & limitations
20
-
21
- You can use the raw model for visual question answering.
22
-
23
- ### How to use
24
-
25
- Here is how to use this model in PyTorch:
26
-
27
- ```python
28
- from transformers import ViltProcessor, ViltForQuestionAnswering
29
- import requests
30
- from PIL import Image
31
-
32
- # prepare image + question
33
- url = "http://images.cocodataset.org/val2017/000000039769.jpg"
34
- image = Image.open(requests.get(url, stream=True).raw)
35
- text = "How many cats are there?"
36
-
37
- processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
38
- model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
39
-
40
- # prepare inputs
41
- encoding = processor(image, text, return_tensors="pt")
42
-
43
- # forward pass
44
- outputs = model(**encoding)
45
- logits = outputs.logits
46
- idx = logits.argmax(-1).item()
47
- print("Predicted answer:", model.config.id2label[idx])
48
- ```
49
-
50
- ## Training data
51
-
52
- (to do)
53
-
54
- ## Training procedure
55
-
56
- ### Preprocessing
57
-
58
- (to do)
59
-
60
- ### Pretraining
61
-
62
- (to do)
63
-
64
- ## Evaluation results
65
-
66
- (to do)
67
-
68
- ### BibTeX entry and citation info
69
-
70
- ```bibtex
71
- @misc{kim2021vilt,
72
- title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
73
- author={Wonjae Kim and Bokyung Son and Ildoo Kim},
74
- year={2021},
75
- eprint={2102.03334},
76
- archivePrefix={arXiv},
77
- primaryClass={stat.ML}
78
- }
79
- ```