File size: 4,199 Bytes
362aa90 9715e7b 53d2f5a 2c1852b 7c6424d d55fc9e 42c0029 7c6424d 149a580 7c6424d 0fc22c0 7c6424d 0fc22c0 7c6424d 0fc22c0 7c6424d 0fc22c0 7c6424d 0fc22c0 362aa90 9c65126 362aa90 28d2cd7 a27b459 362aa90 a972777 362aa90 d6e1557 a27b459 362aa90 a972777 6d04760 362aa90 f1dafee 6d04760 28d2cd7 a972777 362aa90 a27b459 855b3ec a27b459 08bd2bf a27b459 e1e390a dc74f76 a27b459 362aa90 e1e390a 07b0ef1 e1e390a a972777 362aa90 149a580 0ad4928 074b040 125cdef 51d756b 125cdef 362aa90 a972777 362aa90 d6e1557 241b9d4 10b8fb8 5b995e0 d6e1557 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
library_name: transformers
datasets:
- laicsiifes/flickr30k-pt-br
language:
- pt
metrics:
- bleu
- rouge
- meteor
- bertscore
base_model:
- adalbertojunior/distilbert-portuguese-cased
pipeline_tag: image-to-text
model-index:
- name: Swin-DistilBERTimbau
results:
- task:
name: Image Captioning
type: image-to-text
dataset:
name: Flickr30K
type: laicsiifes/flickr30k-pt-br
split: test
metrics:
- name: CIDEr-D
type: cider
value: 66.73
- name: BLEU@4
type: bleu
value: 24.65
- name: ROUGE-L
type: rouge
value: 39.98
- name: METEOR
type: meteor
value: 44.71
- name: BERTScore
type: bertscore
value: 72.30
---
# 🎉 Swin-DistilBERTimbau for Brazilian Portuguese Image Captioning
Swin-DistilBERTimbau model trained for image captioning on [Flickr30K Portuguese](https://huggingface.co/datasets/laicsiifes/flickr30k-pt-br) (translated version using Google Translator API)
at resolution 224x224 and max sequence length of 512 tokens.
## 🤖 Model Description
The Swin-DistilBERTimbau is a type of Vision Encoder Decoder which leverage the checkpoints of the [Swin Transformer](https://huggingface.co/microsoft/swin-base-patch4-window7-224)
as encoder and the checkpoints of the [DistilBERTimbau](https://huggingface.co/adalbertojunior/distilbert-portuguese-cased) as decoder.
The encoder checkpoints come from Swin Trasnformer version pre-trained on ImageNet-1k at resolution 224x224.
The code used for training and evaluation is available at: https://github.com/laicsiifes/ved-transformer-caption-ptbr. In this work, Swin-DistilBERTimbau
was trained together with its buddy [Swin-GPorTuguese-2](https://huggingface.co/laicsiifes/swin-gpt2-flickr30k-pt-br).
Other models evaluated did not perform as well as Swin-DistilBERTimbau and Swin-GPorTuguese-2, namely: DeiT-BERTimbau,
DeiT-DistilBERTimbau, DeiT-GPorTuguese-2, Swin-BERTimbau, ViT-BERTimbau, ViT-DistilBERTimbau and ViT-GPorTuguese-2.
## 🧑💻 How to Get Started with the Model
Use the code below to get started with the model.
```python
import requests
from PIL import Image
from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
# load a fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("laicsiifes/swin-distilbertimbau")
tokenizer = AutoTokenizer.from_pretrained("laicsiifes/swin-distilbertimbau")
image_processor = AutoImageProcessor.from_pretrained("laicsiifes/swin-distilbertimbau")
# preprocess an image
url = "http://images.cocodataset.org/val2014/COCO_val2014_000000458153.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values
# generate caption
generated_ids = model.generate(pixel_values)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
```python
import matplotlib.pyplot as plt
# plot image with caption
plt.imshow(image)
plt.axis("off")
plt.title(generated_text)
plt.show()
```
![image/png](example.png)
## 📈 Results
The evaluation metrics CIDEr-D, BLEU@4, ROUGE-L, METEOR and BERTScore
(using [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased)) are abbreviated as C, B@4, RL, M and BS, respectively.
|Model|Dataset|Eval. Split|C|B@4|RL|M|BS|
|:---:|:------:|:--------:|:-----:|:----:|:-----:|:----:|:-------:|
|Swin-DistilBERTimbau|Flickr30K Portuguese|test|66.73|24.65|39.98|44.71|72.30|
|Swin-GPorTuguese-2|Flickr30K Portuguese|test|64.71|23.15|39.39|44.36|71.70|
## 📋 BibTeX entry and citation info
```bibtex
@inproceedings{bromonschenkel2024comparative,
title={A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning},
author={Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and Paix{\~a}o, Thiago M},
booktitle={2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)},
pages={1--6},
year={2024},
organization={IEEE}
}
``` |