Mozilla
/

distilvit

Transformers.js

vision-encoder-decoder

image-text-to-text

image-captioning

Model card Files Files and versions Community

distilvit / README.md

tarekziade's picture

fined tuned on alt-text-validation

418a5c8 5 months ago

|

2.62 kB

	---
	tags:
	- image-to-text
	- image-captioning
	license: apache-2.0
	metrics:
	- rouge
	datasets:
	- nlphuji/flickr30k
	widget:
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
	example_title: Savanna
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
	example_title: Football Match
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
	example_title: Airport
	base_model:
	- google/vit-base-patch16-224-in21k

	model-index:
	- name: mozilla/distilvit
	results:
	- task:
	type: image-to-text
	name: Image To Text
	dataset:
	name: nlphuji/flickr30k
	type: nlphuji/flickr30k
	metrics:
	- name: ROUGE-1
	type: rouge
	value: 43.006
	verified: true
	- name: ROUGE-2
	type: rouge
	value: 16.9939
	verified: true
	- name: ROUGE-L
	type: rouge
	value: 38.8923
	verified: true
	- name: ROUGE-LSUM
	type: rouge
	value: 38.8877
	verified: true
	- name: loss
	type: loss
	value: 0.19939416646957397
	- name: gen_len
	type: gen_len
	value: 11.327256736227712
	verified: true
	---

	# distilvit

	This model is a work in progress. Fine-tuned version of those base models:

	- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
	- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2

	This model was trained on:

	- Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
	- COCO 2017: https://cocodataset.org

	You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.

	It was then further fine-tuned on :

	- [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions)
	- [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
	- [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)

	For the latter, the dataset was annotated by our team to correct the alt text generayed by the model,
	using the [checkvite tool](https://github.com/mozila/checkvite).

	You can find the code used to create the model here: https://github.com/mozilla/distilvit

	### Framework versions

	- Transformers 4.40.2
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.1
	- Tokenizers 0.19.1