JustinLin610 commited on
Commit
54f523d
1 Parent(s): 4793a6a
.idea/.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Default ignored files
2
+ /shelf/
3
+ /workspace.xml
.idea/OFA-Image_Caption.iml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <module type="PYTHON_MODULE" version="4">
3
+ <component name="NewModuleRootManager">
4
+ <content url="file://$MODULE_DIR$" />
5
+ <orderEntry type="jdk" jdkName="Python 3.7 (py37)" jdkType="Python SDK" />
6
+ <orderEntry type="sourceFolder" forTests="false" />
7
+ </component>
8
+ </module>
.idea/inspectionProfiles/profiles_settings.xml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ <component name="InspectionProjectProfileManager">
2
+ <settings>
3
+ <option name="USE_PROJECT_PROFILE" value="false" />
4
+ <version value="1.0" />
5
+ </settings>
6
+ </component>
.idea/misc.xml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <project version="4">
3
+ <component name="ProjectRootManager" version="2" project-jdk-name="Python 3.7 (py37)" project-jdk-type="Python SDK" />
4
+ </project>
.idea/modules.xml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <project version="4">
3
+ <component name="ProjectModuleManager">
4
+ <modules>
5
+ <module fileurl="file://$PROJECT_DIR$/.idea/OFA-Image_Caption.iml" filepath="$PROJECT_DIR$/.idea/OFA-Image_Caption.iml" />
6
+ </modules>
7
+ </component>
8
+ </project>
.idea/vcs.xml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <project version="4">
3
+ <component name="VcsDirectoryMappings">
4
+ <mapping directory="$PROJECT_DIR$" vcs="Git" />
5
+ </component>
6
+ </project>
README.md CHANGED
@@ -1,12 +1,102 @@
1
- ---
2
- title: OFA Image_Caption
3
- emoji: 🐨
4
- colorFrom: gray
5
- colorTo: blue
6
- sdk: gradio
7
- app_file: app.py
8
- pinned: false
9
- license: apache-2.0
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OFA
2
+
3
+ [[Paper]](http://arxiv.org/abs/2202.03052) [Blog] [[Colab](colab.md)]
4
+
5
+ ![Overview](examples/overview.png)
6
+
7
+ OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks
8
+ (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.)
9
+ to a simple sequence-to-sequence learning framework. For more information, please refer to our paper: [Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework](http://arxiv.org/abs/2202.03052).
10
+
11
+
12
+ ## News
13
+ * 2022.2.11: Released the Colab notebook for image captioning [![][colab]](https://colab.research.google.com/drive/1Q4eNhhhLcgOP4hHqwZwU1ijOlabgve1W?usp=sharing). Enjoy!
14
+ * 2022.2.11: Released the pretrained checkpoint of OFA-Large and the complete (2-staged) finetuning code for image captioning.
15
+ * 2022.2.10: Released the inference code & finetuned checkpoint for image captioning, which can reproduce **the results on COCO Karparthy test split (149.6 CIDEr)**
16
+
17
+ [colab]: <https://colab.research.google.com/assets/colab-badge.svg>
18
+
19
+ ## TODO
20
+ * To release finetuning and inference codes for multimodal downstream tasks soon, including image captioning, VQA, text-to-image generation, SNLI-VE, Referring expression, comprehension, etc.
21
+ * To release codes for pretraining soon.
22
+
23
+
24
+ ## Approach
25
+ ![approach](examples/approach.jpg)
26
+
27
+
28
+ ## Requirements
29
+ * python 3.7.4
30
+ * pytorch 1.8.1
31
+ * JAVA 1.8 (for COCO evaluation)
32
+
33
+
34
+ ## Installation
35
+ ```bash
36
+ git clone https://github.com/OFA-Sys/OFA
37
+ pip install -r requirements.txt
38
+ ```
39
+
40
+
41
+ ## Datasets and Checkpoints
42
+ See [datasets.md](datasets.md) and [checkpoints.md](checkpoints.md).
43
+
44
+
45
+ ## Pretraining
46
+ To release soon:)
47
+
48
+
49
+ # Finetuning & Inference
50
+ Below we provide methods for fintuning and inference on different downstream tasks.
51
+ ## Caption
52
+ 1. Download data and files and put them in the correct directory
53
+ 2. Train
54
+ ```bash
55
+ cd run_scripts/caption
56
+ nohup sh train_caption_stage1.sh & # stage1, train with cross-entropy loss
57
+ nohup sh train_caption_stage2.sh & # stage2, load the best ckpt of stage1 and train with CIDEr optimization
58
+ ```
59
+ 3. Inference
60
+ ```bash
61
+ cd run_scripts/caption ; sh evaluate_caption.sh # inference & evaluate
62
+ ```
63
+
64
+ # Gallery
65
+ Below we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains).
66
+
67
+ ## Text-to-Image Generation (normal query)
68
+ ![t2i_normal](examples/normal_images.png)
69
+
70
+ ## Text-to-Image Generation (counterfactual query)
71
+ ![t2i_counterfactual](examples/counterfactual_images.png)
72
+
73
+ ## Open-Ended VQA
74
+ ![open_vqa](examples/open_vqa.png)
75
+
76
+ ## Grounded QA (unseen task)
77
+ ![grounded_qa](examples/grounded_qa.png)
78
+
79
+ ## Viusal Grounding (unseen domain)
80
+ ![vg](examples/viusal_grounding.png)
81
+
82
+
83
+ ## Citation
84
+ Please cite our paper if you find it helpful :)
85
+
86
+ ```
87
+ @article{wang2022OFA,
88
+ title={Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework},
89
+ author={Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia},
90
+ journal={arXiv e-prints},
91
+ pages={arXiv--2202},
92
+ year={2022}
93
+ }
94
+ ```
95
+
96
+
97
+ ## Related Codebase
98
+ * [fairseq](https://github.com/pytorch/fairseq)
99
+
100
+
101
+ ## License
102
+ Apache-2.0
app.py ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import torch
4
+ import numpy as np
5
+ from fairseq import utils,tasks
6
+ from utils import checkpoint_utils
7
+ from utils.eval_utils import eval_step
8
+ from tasks.mm_tasks.caption import CaptionTask
9
+ from models.ofa import OFAModel
10
+ from PIL import Image
11
+ from torchvision import transforms
12
+
13
+
14
+ # Register caption task
15
+ tasks.register_task('caption',CaptionTask)
16
+ # turn on cuda if GPU is available
17
+ use_cuda = torch.cuda.is_available()
18
+ # use fp16 only when GPU is available
19
+ use_fp16 = False
20
+
21
+ os.system('wget https://ofa-silicon.oss-us-west-1.aliyuncs.com/checkpoints/caption_large_best_clean.pt')
22
+ os.system('mkdir -p checkpoints')
23
+ os.system('mv caption_large_best_clean.pt checkpoints/caption.pt')
24
+
25
+ # Load pretrained ckpt & config
26
+ overrides = {"bpe_dir": "utils/BPE", "eval_cider": False, "beam": 5,
27
+ "max_len_b": 16, "no_repeat_ngram_size": 3, "seed": 7}
28
+ models, cfg, task = checkpoint_utils.load_model_ensemble_and_task(
29
+ utils.split_paths('checkpoints/caption.pt'),
30
+ arg_overrides=overrides
31
+ )
32
+
33
+ # Move models to GPU
34
+ for model in models:
35
+ model.eval()
36
+ if use_fp16:
37
+ model.half()
38
+ if use_cuda and not cfg.distributed_training.pipeline_model_parallel:
39
+ model.cuda()
40
+ model.prepare_for_inference_(cfg)
41
+
42
+ # Initialize generator
43
+ generator = task.build_generator(models, cfg.generation)
44
+
45
+ mean = [0.5, 0.5, 0.5]
46
+ std = [0.5, 0.5, 0.5]
47
+
48
+ patch_resize_transform = transforms.Compose([
49
+ lambda image: image.convert("RGB"),
50
+ transforms.Resize((cfg.task.patch_image_size, cfg.task.patch_image_size), interpolation=Image.BICUBIC),
51
+ transforms.ToTensor(),
52
+ transforms.Normalize(mean=mean, std=std),
53
+ ])
54
+
55
+ # Text preprocess
56
+ bos_item = torch.LongTensor([task.src_dict.bos()])
57
+ eos_item = torch.LongTensor([task.src_dict.eos()])
58
+ pad_idx = task.src_dict.pad()
59
+
60
+
61
+ def encode_text(text, length=None, append_bos=False, append_eos=False):
62
+ s = task.tgt_dict.encode_line(
63
+ line=task.bpe.encode(text),
64
+ add_if_not_exist=False,
65
+ append_eos=False
66
+ ).long()
67
+ if length is not None:
68
+ s = s[:length]
69
+ if append_bos:
70
+ s = torch.cat([bos_item, s])
71
+ if append_eos:
72
+ s = torch.cat([s, eos_item])
73
+ return s
74
+
75
+
76
+ # Construct input for caption task
77
+ def construct_sample(image: Image):
78
+ patch_image = patch_resize_transform(image).unsqueeze(0)
79
+ patch_mask = torch.tensor([True])
80
+ src_text = encode_text(" what does the image describe?", append_bos=True, append_eos=True).unsqueeze(0)
81
+ src_length = torch.LongTensor([s.ne(pad_idx).long().sum() for s in src_text])
82
+ sample = {
83
+ "id": np.array(['42']),
84
+ "net_input": {
85
+ "src_tokens": src_text,
86
+ "src_lengths": src_length,
87
+ "patch_images": patch_image,
88
+ "patch_masks": patch_mask
89
+ }
90
+ }
91
+ return sample
92
+
93
+
94
+ # Function to turn FP32 to FP16
95
+ def apply_half(t):
96
+ if t.dtype is torch.float32:
97
+ return t.to(dtype=torch.half)
98
+ return t
99
+
100
+
101
+ # Function for image captioning
102
+ def image_caption(inp):
103
+ sample = construct_sample(inp)
104
+ sample = utils.move_to_cuda(sample) if use_cuda else sample
105
+ sample = utils.apply_to_sample(apply_half, sample) if use_fp16 else sample
106
+ with torch.no_grad():
107
+ result, scores = eval_step(task, generator, models, sample)
108
+ return result[0]['caption']
109
+
110
+
111
+ io = gr.Interface(fn=image_caption, inputs=gr.inputs.Image(type='pil'), outputs='text')
112
+ io.launch(debug=True)
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ -e ./fairseq/
2
+ ftfy==6.0.3
3
+ tensorboardX==2.4.1
4
+ pycocotools==2.0.4
5
+ pycocoevalcap==1.2