|
---
|
|
tags:
|
|
- image-to-text
|
|
- image-captioning
|
|
license: apache-2.0
|
|
metrics:
|
|
- rouge
|
|
datasets:
|
|
- nlphuji/flickr30k
|
|
widget:
|
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
|
|
example_title: Savanna
|
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
|
|
example_title: Football Match
|
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
|
|
example_title: Airport
|
|
base_model:
|
|
- google/vit-base-patch16-224-in21k
|
|
|
|
model-index:
|
|
- name: mozilla/distilvit
|
|
results:
|
|
- task:
|
|
type: image-to-text
|
|
name: Image To Text
|
|
dataset:
|
|
name: nlphuji/flickr30k
|
|
type: nlphuji/flickr30k
|
|
metrics:
|
|
- name: ROUGE-1
|
|
type: rouge
|
|
value: 43.006
|
|
verified: true
|
|
- name: ROUGE-2
|
|
type: rouge
|
|
value: 16.9939
|
|
verified: true
|
|
- name: ROUGE-L
|
|
type: rouge
|
|
value: 38.8923
|
|
verified: true
|
|
- name: ROUGE-LSUM
|
|
type: rouge
|
|
value: 38.8877
|
|
verified: true
|
|
- name: loss
|
|
type: loss
|
|
value: 0.19939416646957397
|
|
- name: gen_len
|
|
type: gen_len
|
|
value: 11.327256736227712
|
|
verified: true
|
|
---
|
|
|
|
# distilvit
|
|
|
|
This model is a work in progress. Fine-tuned version of those base models:
|
|
|
|
- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
|
|
- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
|
|
|
|
This model was trained on:
|
|
|
|
- Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
|
|
- COCO 2017: https://cocodataset.org
|
|
|
|
You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.
|
|
|
|
It was then further fine-tuned on :
|
|
|
|
- [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions)
|
|
- [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
|
|
- [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
|
|
|
|
For the latter, the dataset was annotated by our team to correct the alt text generayed by the model,
|
|
using the [checkvite tool](https://github.com/mozila/checkvite).
|
|
|
|
You can find the code used to create the model here: https://github.com/mozilla/distilvit
|
|
|
|
### Framework versions
|
|
|
|
- Transformers 4.40.2
|
|
- Pytorch 2.3.0+cu121
|
|
- Datasets 2.19.1
|
|
- Tokenizers 0.19.1
|
|
|