Spaces:

OFA-Sys
/

OFA-Image_Caption

Runtime error

App Files Files Community

JustinLin610 commited on Feb 12, 2022

Commit

54f523d

•

1 Parent(s): 4793a6a

add code

Browse files

Files changed (9) hide show

.idea/.gitignore +3 -0
.idea/OFA-Image_Caption.iml +8 -0
.idea/inspectionProfiles/profiles_settings.xml +6 -0
.idea/misc.xml +4 -0
.idea/modules.xml +8 -0
.idea/vcs.xml +6 -0
README.md +102 -12
app.py +112 -0
requirements.txt +5 -0

.idea/.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+# Default ignored files
+/shelf/
+/workspace.xml

.idea/OFA-Image_Caption.iml ADDED Viewed

	@@ -0,0 +1,8 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<module type="PYTHON_MODULE" version="4">
+  <component name="NewModuleRootManager">
+    <content url="file://$MODULE_DIR$" />
+    <orderEntry type="jdk" jdkName="Python 3.7 (py37)" jdkType="Python SDK" />
+    <orderEntry type="sourceFolder" forTests="false" />
+  </component>
+</module>

.idea/inspectionProfiles/profiles_settings.xml ADDED Viewed

	@@ -0,0 +1,6 @@

+<component name="InspectionProjectProfileManager">
+  <settings>
+    <option name="USE_PROJECT_PROFILE" value="false" />
+    <version value="1.0" />
+  </settings>
+</component>

.idea/misc.xml ADDED Viewed

	@@ -0,0 +1,4 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectRootManager" version="2" project-jdk-name="Python 3.7 (py37)" project-jdk-type="Python SDK" />
+</project>

.idea/modules.xml ADDED Viewed

	@@ -0,0 +1,8 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectModuleManager">
+    <modules>
+      <module fileurl="file://$PROJECT_DIR$/.idea/OFA-Image_Caption.iml" filepath="$PROJECT_DIR$/.idea/OFA-Image_Caption.iml" />
+    </modules>
+  </component>
+</project>

.idea/vcs.xml ADDED Viewed

	@@ -0,0 +1,6 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="VcsDirectoryMappings">
+    <mapping directory="$PROJECT_DIR$" vcs="Git" />
+  </component>
+</project>

README.md CHANGED Viewed

@@ -1,12 +1,102 @@
----
-title: OFA Image_Caption
-emoji: 🐨
-colorFrom: gray
-colorTo: blue
-sdk: gradio
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

+# OFA
+[[Paper]](http://arxiv.org/abs/2202.03052) [Blog] [[Colab](colab.md)]
+![Overview](examples/overview.png)
+OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks
+(e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.)
+to a simple sequence-to-sequence learning framework. For more information, please refer to our paper: [Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework](http://arxiv.org/abs/2202.03052).
+## News
+* 2022.2.11: Released the Colab notebook for image captioning [![][colab]](https://colab.research.google.com/drive/1Q4eNhhhLcgOP4hHqwZwU1ijOlabgve1W?usp=sharing). Enjoy!
+* 2022.2.11: Released the pretrained checkpoint of OFA-Large and the complete (2-staged) finetuning code for image captioning.
+* 2022.2.10: Released the inference code & finetuned checkpoint for image captioning, which can reproduce **the results on COCO Karparthy test split (149.6 CIDEr)**
+[colab]: <https://colab.research.google.com/assets/colab-badge.svg>
+## TODO
+* To release finetuning and inference codes for multimodal downstream tasks soon, including image captioning, VQA, text-to-image generation, SNLI-VE, Referring expression, comprehension, etc.
+* To release codes for pretraining soon.
+## Approach
+![approach](examples/approach.jpg)
+## Requirements
+* python 3.7.4
+* pytorch 1.8.1
+* JAVA 1.8 (for COCO evaluation)
+## Installation
+```bash
+git clone https://github.com/OFA-Sys/OFA
+pip install -r requirements.txt
+```
+## Datasets and Checkpoints
+See [datasets.md](datasets.md) and [checkpoints.md](checkpoints.md).
+## Pretraining
+To release soon:)
+# Finetuning & Inference
+Below we provide methods for fintuning and inference on different downstream tasks.
+## Caption
+1. Download data and files and put them in the correct directory
+2. Train
+```bash
+cd run_scripts/caption
+nohup sh train_caption_stage1.sh &  # stage1, train with cross-entropy loss
+nohup sh train_caption_stage2.sh &  # stage2, load the best ckpt of stage1 and train with CIDEr optimization
+```
+3. Inference
+```bash
+cd run_scripts/caption ; sh evaluate_caption.sh  # inference & evaluate
+```
+# Gallery
+Below we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains).
+## Text-to-Image Generation (normal query)
+![t2i_normal](examples/normal_images.png)
+## Text-to-Image Generation (counterfactual query)
+![t2i_counterfactual](examples/counterfactual_images.png)
+## Open-Ended VQA
+![open_vqa](examples/open_vqa.png)
+## Grounded QA (unseen task)
+![grounded_qa](examples/grounded_qa.png)
+## Viusal Grounding (unseen domain)
+![vg](examples/viusal_grounding.png)
+## Citation
+Please cite our paper if you find it helpful :)
+```
+@article{wang2022OFA,
+  title={Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework},
+  author={Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia},
+  journal={arXiv e-prints},
+  pages={arXiv--2202},
+  year={2022}
+}
+```
+## Related Codebase
+* [fairseq](https://github.com/pytorch/fairseq)
+## License
+Apache-2.0

app.py ADDED Viewed

	@@ -0,0 +1,112 @@

+import gradio as gr
+import os
+import torch
+import numpy as np
+from fairseq import utils,tasks
+from utils import checkpoint_utils
+from utils.eval_utils import eval_step
+from tasks.mm_tasks.caption import CaptionTask
+from models.ofa import OFAModel
+from PIL import Image
+from torchvision import transforms
+# Register caption task
+tasks.register_task('caption',CaptionTask)
+# turn on cuda if GPU is available
+use_cuda = torch.cuda.is_available()
+# use fp16 only when GPU is available
+use_fp16 = False
+os.system('wget https://ofa-silicon.oss-us-west-1.aliyuncs.com/checkpoints/caption_large_best_clean.pt')
+os.system('mkdir -p checkpoints')
+os.system('mv caption_large_best_clean.pt checkpoints/caption.pt')
+# Load pretrained ckpt & config
+overrides = {"bpe_dir": "utils/BPE", "eval_cider": False, "beam": 5,
+             "max_len_b": 16, "no_repeat_ngram_size": 3, "seed": 7}
+models, cfg, task = checkpoint_utils.load_model_ensemble_and_task(
+        utils.split_paths('checkpoints/caption.pt'),
+        arg_overrides=overrides
+    )
+# Move models to GPU
+for model in models:
+    model.eval()
+    if use_fp16:
+        model.half()
+    if use_cuda and not cfg.distributed_training.pipeline_model_parallel:
+        model.cuda()
+    model.prepare_for_inference_(cfg)
+# Initialize generator
+generator = task.build_generator(models, cfg.generation)
+mean = [0.5, 0.5, 0.5]
+std = [0.5, 0.5, 0.5]
+patch_resize_transform = transforms.Compose([
+    lambda image: image.convert("RGB"),
+    transforms.Resize((cfg.task.patch_image_size, cfg.task.patch_image_size), interpolation=Image.BICUBIC),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=mean, std=std),
+])
+# Text preprocess
+bos_item = torch.LongTensor([task.src_dict.bos()])
+eos_item = torch.LongTensor([task.src_dict.eos()])
+pad_idx = task.src_dict.pad()
+def encode_text(text, length=None, append_bos=False, append_eos=False):
+    s = task.tgt_dict.encode_line(
+        line=task.bpe.encode(text),
+        add_if_not_exist=False,
+        append_eos=False
+    ).long()
+    if length is not None:
+        s = s[:length]
+    if append_bos:
+        s = torch.cat([bos_item, s])
+    if append_eos:
+        s = torch.cat([s, eos_item])
+    return s
+# Construct input for caption task
+def construct_sample(image: Image):
+    patch_image = patch_resize_transform(image).unsqueeze(0)
+    patch_mask = torch.tensor([True])
+    src_text = encode_text(" what does the image describe?", append_bos=True, append_eos=True).unsqueeze(0)
+    src_length = torch.LongTensor([s.ne(pad_idx).long().sum() for s in src_text])
+    sample = {
+        "id": np.array(['42']),
+        "net_input": {
+            "src_tokens": src_text,
+            "src_lengths": src_length,
+            "patch_images": patch_image,
+            "patch_masks": patch_mask
+        }
+    }
+    return sample
+# Function to turn FP32 to FP16
+def apply_half(t):
+    if t.dtype is torch.float32:
+        return t.to(dtype=torch.half)
+    return t
+# Function for image captioning
+def image_caption(inp):
+    sample = construct_sample(inp)
+    sample = utils.move_to_cuda(sample) if use_cuda else sample
+    sample = utils.apply_to_sample(apply_half, sample) if use_fp16 else sample
+    with torch.no_grad():
+        result, scores = eval_step(task, generator, models, sample)
+    return result[0]['caption']
+io = gr.Interface(fn=image_caption, inputs=gr.inputs.Image(type='pil'), outputs='text')
+io.launch(debug=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+-e ./fairseq/
+ftfy==6.0.3
+tensorboardX==2.4.1
+pycocotools==2.0.4
+pycocoevalcap==1.2