File size: 5,946 Bytes
de6537d
 
d6776e0
a7a8b05
3cb13dd
64dae59
 
 
 
 
 
 
 
 
de6537d
64dae59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0acc2a
 
d165671
b0acc2a
 
 
64dae59
 
b0acc2a
64dae59
b0acc2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64dae59
 
b0acc2a
64dae59
7f5fbf2
64dae59
 
b0acc2a
64dae59
b0acc2a
64dae59
b0acc2a
64dae59
b0acc2a
7215884
b0acc2a
64dae59
 
 
b0acc2a
 
 
 
 
64dae59
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: apache-2.0
inference: true
widget:
 - text: "<s>[INST] <<SYS>>\nGiven an image description, generate one or two multiple-choice questions that verifies if the image description is correct.\nClassify each concept into a type (object, human, animal, food, activity, attribute, counting, color, material, spatial, location, shape, other), and then generate a question for each type.\n\n<</SYS>>\n\nDescription: a blue rabbit and a red plane [/INST] Entities:"
pipeline_tag: text-generation
tags:
- text-generation-inference
- llama2
- text-to-image
datasets:
- TIFA
language:
- en
---
This is the text parsing and question generation model for the ICCV 2023 paper [TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering](https://arxiv.org/abs/2303.11897)

We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. 

Specifically, this fine-tuned LLaMA 2 model is the substitute for the GPT-3 model in the paper. It can parse an arbitrary prompt into visual entities, attributes, relations, etc. and generate question-answer tuples for each of them. See examples below.


# QuickStart

All codes are from <https://github.com/Yushi-Hu/tifa>. Clone this repo to easily use this model together with other modules (e.g. VQA) provided in TIFA.

Please follow the prompt format, which will give the best performance.


```python
import torch
import transformers

# prepare the LLaMA 2 model
model_name = "/gscratch/tial/yushihu/tifa-all/llama2/results/llama2/final_question_generation_checkpoint"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)


# formating prompt following LLaMA 2 style
def create_qg_prompt(caption):
    INTRO_BLURB = "Given an image description, generate one or two multiple-choice questions that verifies if the image description is correct.\nClassify each concept into a type (object, human, animal, food, activity, attribute, counting, color, material, spatial, location, shape, other), and then generate a question for each type.\n"
    formated_prompt = f"<s>[INST] <<SYS>>\n{INTRO_BLURB}\n<</SYS>>\n\n"
    formated_prompt += f"Description: {caption} [/INST] Entities:"
    return formated_prompt


test_caption = "a blue rabbit and a red plane"

# create prompt
prompt = create_qg_prompt(text_caption)

# text completion
sequences = pipeline(
        prompt, do_sample=False, num_beams=5, num_return_sequences=1, max_length=512)
output = sequences[0]['generated_text'][len(prompt):]
output = output.split('\n\n')[0]

# output
print(output)

#### Expected output ###
#  rabbit, plane
# Activites:
# Colors: blue, red
# Counting:
# Other attributes:
# About rabbit (animal):
# Q: is this a rabbit?
# Choices: yes, no
# A: yes
# About rabbit (animal):
# Q: what animal is in the picture?
# Choices: rabbit, dog, cat, fish
# A: rabbit
# About plane (object):
# Q: is this a plane?
# Choices: yes, no
# A: yes
# About plane (object):
# Q: what type of vehicle is this?
# Choices: plane, car, motorcycle, bus
# A: plane
# About blue (color):
# Q: is the rabbit blue?
# Choices: yes, no
# A: yes
# About blue (color):
# Q: what color is the rabbit?
# Choices: blue, red, yellow, green
# A: blue
# About red (color):
# Q: is the plane red?
# Choices: yes, no
# A: yes
# About red (color):
# Q: what color is the plane?
# Choices: red, blue, yellow, green
# A: red
```

# Use this LM under tifascore package

tifascore provides extra functions to parse this output etc. First install tifascore according to <https://github.com/Yushi-Hu/tifa>. Then the usage is below

```python
from tifascore import get_llama2_pipeline, get_llama2_question_and_answers

pipeline = get_llama2_pipeline("tifa-benchmark/llama2_tifa_question_generation")

print(get_llama2_question_and_answers(pipeline, "a blue rabbit and a red plane"))

#### Expected output ###
# [{'caption': 'a blue rabbit and a red plane', 'element': 'rabbit', 'question': 'what animal is in the picture?', 'choices': ['rabbit', 'dog', 'cat', 'fish'], 'answer': 'rabbit', 'element_type': 'animal/human'}, {'caption': 'a blue rabbit and a red plane', 'element': 'plane', 'question': 'is this a plane?', 'choices': ['yes', 'no'], 'answer': 'yes', 'element_type': 'object'}, {'caption': 'a blue rabbit and a red plane', 'element': 'plane', 'question': 'what type of vehicle is this?', 'choices': ['plane', 'car', 'motorcycle', 'bus'], 'answer': 'plane', 'element_type': 'object'}, {'caption': 'a blue rabbit and a red plane', 'element': 'blue', 'question': 'is the rabbit blue?', 'choices': ['yes', 'no'], 'answer': 'yes', 'element_type': 'color'}, {'caption': 'a blue rabbit and a red plane', 'element': 'blue', 'question': 'what color is the rabbit?', 'choices': ['blue', 'red', 'yellow', 'green'], 'answer': 'blue', 'element_type': 'color'}, {'caption': 'a blue rabbit and a red plane', 'element': 'red', 'question': 'is the plane red?', 'choices': ['yes', 'no'], 'answer': 'yes', 'element_type': 'color'}, {'caption': 'a blue rabbit and a red plane', 'element': 'red', 'question': 'what color is the plane?', 'choices': ['red', 'blue', 'yellow', 'green'], 'answer': 'red', 'element_type': 'color'}]
```

## Bibtex
```
@article{hu2023tifa,
  title={Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering},
  author={Hu, Yushi and Liu, Benlin and Kasai, Jungo and Wang, Yizhong and Ostendorf, Mari and Krishna, Ranjay and Smith, Noah A},
  journal={arXiv preprint arXiv:2303.11897},
  year={2023}
}
```