File size: 2,192 Bytes
2325ef4
 
 
2ecc0b9
 
 
 
 
 
 
 
 
 
965ce12
2ecc0b9
 
 
 
 
 
 
 
 
f7fd166
2ecc0b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
license: apache-2.0
---

# OFA-base
This is the **base** version of OFA pretrained model. OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework.

The directory includes 4 files, namely `config.json` which consists of model configuration, `vocab.json` and `merge.txt` for our OFA tokenizer, and lastly `pytorch_model.bin` which consists of model weights. There is no need to worry about the mismatch between Fairseq and transformers, since we have addressed the issue yet. 

To use it in transformers, please refer to https://github.com/OFA-Sys/OFA/tree/feature/add_transformers. Install the transformers and download the models as shown below.
```
git clone --single-branch --branch feature/add_transformers https://github.com/OFA-Sys/OFA.git
pip install OFA/transformers/
git clone https://huggingface.co/OFA-Sys/OFA-base
```
After, refer the path to OFA-base to `ckpt_dir`, and prepare an image for the testing example below. Also, ensure that you have pillow and torchvision in your environment. 

```
>>> from PIL import Image
>>> from torchvision import transforms
>>> from transformers import OFATokenizer, OFAForConditionalGeneration

>>> mean, std = [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]
>>> resolution = 384
>>> patch_resize_transform = transforms.Compose([
        lambda image: image.convert("RGB"),
        transforms.Resize((resolution, resolution), interpolation=Image.BICUBIC),
        transforms.ToTensor(), 
        transforms.Normalize(mean=mean, std=std)
    ])

>>> model = OFAForConditionalGeneration.from_pretrained(ckpt_dir)
>>> tokenizer = OFATokenizer.from_pretrained(ckpt_dir)

>>> txt = " what is the description of the image?"
>>> inputs = tokenizer([txt], max_length=1024, return_tensors="pt")["input_ids"]
>>> img = Image.open(path_to_image)
>>> patch_img = patch_resize_transform(img).unsqueeze(0)

>>> gen = model.generate(inputs, patch_images=patch_img, num_beams=4)
>>> print(tokenizer.batch_decode(gen, skip_special_tokens=True))
```