michelecafagna26
/

blip-base-captioning-ft-hl-actions

image-text-to-text

image-captioning

Inference Endpoints

Model card Files Files and versions Community

blip-base-captioning-ft-hl-actions / README.md

michelecafagna26's picture

michelecafagna26

Update README.md

c47b5c5 over 1 year ago

|

history blame contribute delete

2.07 kB

	---
	license: apache-2.0
	tags:
	- image-captioning
	languages:
	- en
	pipeline_tag: image-to-text
	datasets:
	- michelecafagna26/hl
	language:
	- en
	metrics:
	- sacrebleu
	- rouge
	library_name: transformers
	---
	## BLIP-base fine-tuned for Image Captioning on High-Level descriptions of Actions

	[BLIP](https://arxiv.org/abs/2201.12086) base trained on the [HL dataset](https://huggingface.co/datasets/michelecafagna26/hl) for action generation of images

	## Model fine-tuning 🏋️‍

	- Trained for 6 epochs
	- lr: 5e−5,
	- Adam optimizer,
	- half-precision (fp16)

	## Test set metrics 🧾

	\| Cider \| SacreBLEU \| Rouge-L\|
	\|--------\|------------\|--------\|
	\| 123.07 \| 17.16 \| 32.16 \|

	## Model in Action 🚀

	```python
	import requests
	from PIL import Image
	from transformers import BlipProcessor, BlipForConditionalGeneration

	processor = BlipProcessor.from_pretrained("michelecafagna26/blip-base-captioning-ft-hl-actions")
	model = BlipForConditionalGeneration.from_pretrained("michelecafagna26/blip-base-captioning-ft-hl-actions").to("cuda")

	img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl/--/default/train/0/image/image.jpg'
	raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')


	inputs = processor(raw_image, return_tensors="pt").to("cuda")
	pixel_values = inputs.pixel_values

	generated_ids = model.generate(pixel_values=pixel_values, max_length=50,
	do_sample=True,
	top_k=120,
	top_p=0.9,
	early_stopping=True,
	num_return_sequences=1)

	processor.batch_decode(generated_ids, skip_special_tokens=True)

	>>> "she is holding an umbrella."
	```

	## BibTex and citation info

	```BibTeX
	@inproceedings{cafagna2023hl,
	title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
	{R}ationales},
	author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
	booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
	address = {Prague, Czech Republic},
	year={2023}
	}
	```