NikshepShetty
/

Florence-2-Recap-DataComp

image-captioning

Model card Files Files and versions Community

Florence-2-Recap-DataComp / README.md

NikshepShetty's picture

Update README.md

16eda9c verified 4 months ago

|

history blame contribute delete

3.42 kB

	---
	datasets:
	- UCSC-VLAA/Recap-DataComp-1B
	language:
	- en
	library_name: peft
	tags:
	- florence-2
	- lora
	- adapter
	- image-captioning
	- peft
	model-index:
	- name: Florence-2-DOCCI-FT
	results:
	- task:
	type: image-to-text
	name: Image Captioning
	dataset:
	name: foundation-multimodal-models/DetailCaps-4870
	type: other
	metrics:
	- type: meteor
	value: 0.240
	- type: bleu
	value: 0.150
	- type: cider
	value: 0.035
	- type: capture
	value: 0.553
	- type: rouge-l
	value: 0.294
	---

	# Florence-2 Recap-DataComp LoRA Adapter

	This repository contains a LoRA adapter trained on the UCSC-VLAA/Recap-DataComp-1B dataset for the Florence-2-base-FT model. It's designed to enhance the model's captioning capabilities, providing more detailed and descriptive image captions.

	## Usage

	To use this LoRA adapter, you'll need to load it along with the Florence-2-base model using the PEFT library. Here's an example of how to use it:

	```python
	from PIL import Image
	from transformers import AutoProcessor, AutoModelForCausalLM
	from peft import PeftModel, PeftConfig
	import requests

	def caption(image):
	base_model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
	processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
	prompt = "<MORE_DETAILED_CAPTION>"
	adapter_name = "NikshepShetty/Florence-2-Recap-DataComp"
	model = PeftModel.from_pretrained(base_model, adapter_name, trust_remote_code=True)
	inputs = processor(text=prompt, images=image, return_tensors="pt")

	generated_ids = model.generate(
	input_ids=inputs["input_ids"],
	pixel_values=inputs["pixel_values"],
	max_new_tokens=1024,
	do_sample=False,
	num_beams=3
	)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

	parsed_answer = processor.post_process_generation(generated_text, task="<MORE_DETAILED_CAPTION>", image_size=(image.width, image.height))

	print(parsed_answer)

	url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
	image = Image.open(requests.get(url, stream=True).raw)
	caption(image)
	```

	This code demonstrates how to:
	1. Load the base Florence-2 model
	2. Load the LoRA adapter
	3. Process an image and generate a detailed caption

	Note: Make sure you have the required libraries installed: transformers, peft, einops, flash_attn, timm, Pillow, and requests.

	## Evaluation results

	Our LoRA adapter shows improvements over the base Florence-2 model across all metrics for MORE_DETAILED_CAPTION tag for 1000 images on the foundation-multimodal-models/DetailCaps-4870 dataset:

	\| Metric \| Base Model \| Adapted Model \| Improvement \|
	\|---------\|------------\|-----------------------\|-------------\|
	\| CAPTURE \| 0.546 \| 0.553 \| +1.3% \|
	\| METEOR \| 0.213 \| 0.240 \| +12.7% \|
	\| BLEU \| 0.110 \| 0.150 \| +36.4% \|
	\| CIDEr \| 0.031 \| 0.035 \| +12.9% \|
	\| ROUGE-L \| 0.275 \| 0.294 \| +6.9% \|

	These results demonstrate that our LoRA adapter enhances the image captioning capabilities of the Florence-2 base model, particularly in generating more detailed and accurate captions.