metadata
library_name: transformers
datasets:
- laicsiifes/flickr30k-pt-br
language:
- pt
metrics:
- bleu
- rouge
- meteor
- bertscore
base_model:
- microsoft/swin-base-patch4-window7-224
pipeline_tag: text-generation
Swin-DistilBERTimbau
Swin-DistilBERTimbau model trained on Flickr30K Portuguese (translated version using Google Translator API) at resolution 224x224 and max sequence length of 512 tokens.
Model Description
The Swin-DistilBERTimbau is a type of Vision Encoder Decoder which leverage the checkpoints of the Swin Trnasformer as encoder and the checkpoints of the DistilBERTimbau as decoder. The encoder checkpoints come from Swin Trasnformer version pre-trained on ImageNet-1k at resolution 224x224.
How to Get Started with the Model
Use the code below to get started with the model.
import requests
from PIL import Image
from transformers import AutoTokenizer, ViTImageProcessor, VisionEncoderDecoderModel
# load a fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("laicsiifes/swin-distilbert-flickr30k-pt-br")
tokenizer = GPT2TokenizerFast.from_pretrained("laicsiifes/swin-distilbert-flickr30k-pt-br")
image_processor = ViTImageProcessor.from_pretrained("laicsiifes/swin-distilbert-flickr30k-pt-br")
# perform inference on an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values
# generate caption
generated_ids = model.generate(pixel_values)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
Results
Model | Training | Evaluation | C | B@4 | RL | M | BS |
---|---|---|---|---|---|---|---|
Swin-DistilBERTimbau | Flickr30K Portuguese | Flickr30K Portuguese | 66.73 | 24.65 | 39.98 | 44.71 | 72.30 |
Swin-GPT-2 | Flickr30K Portuguese | Flickr30K Portuguese | 64.71 | 23.15 | 39.39 | 44.36 | 71.70 |
BibTeX:
[More Information Needed]