|
--- |
|
datasets: |
|
- UCSC-VLAA/Recap-DataComp-1B |
|
language: |
|
- en |
|
library_name: peft |
|
tags: |
|
- florence-2 |
|
- lora |
|
- adapter |
|
- image-captioning |
|
- peft |
|
model-index: |
|
- name: Florence-2-DOCCI-FT |
|
results: |
|
- task: |
|
type: image-to-text |
|
name: Image Captioning |
|
dataset: |
|
name: foundation-multimodal-models/DetailCaps-4870 |
|
type: other |
|
metrics: |
|
- type: meteor |
|
value: 0.240 |
|
- type: bleu |
|
value: 0.150 |
|
- type: cider |
|
value: 0.035 |
|
- type: capture |
|
value: 0.553 |
|
- type: rouge-l |
|
value: 0.294 |
|
--- |
|
|
|
# Florence-2 Recap-DataComp LoRA Adapter |
|
|
|
This repository contains a LoRA adapter trained on the UCSC-VLAA/Recap-DataComp-1B dataset for the Florence-2-base-FT model. It's designed to enhance the model's captioning capabilities, providing more detailed and descriptive image captions. |
|
|
|
## Usage |
|
|
|
To use this LoRA adapter, you'll need to load it along with the Florence-2-base model using the PEFT library. Here's an example of how to use it: |
|
|
|
```python |
|
from PIL import Image |
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
from peft import PeftModel, PeftConfig |
|
import requests |
|
|
|
def caption(image): |
|
base_model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True) |
|
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True) |
|
prompt = "<MORE_DETAILED_CAPTION>" |
|
adapter_name = "NikshepShetty/Florence-2-Recap-DataComp" |
|
model = PeftModel.from_pretrained(base_model, adapter_name, trust_remote_code=True) |
|
inputs = processor(text=prompt, images=image, return_tensors="pt") |
|
|
|
generated_ids = model.generate( |
|
input_ids=inputs["input_ids"], |
|
pixel_values=inputs["pixel_values"], |
|
max_new_tokens=1024, |
|
do_sample=False, |
|
num_beams=3 |
|
) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0] |
|
|
|
parsed_answer = processor.post_process_generation(generated_text, task="<MORE_DETAILED_CAPTION>", image_size=(image.width, image.height)) |
|
|
|
print(parsed_answer) |
|
|
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
caption(image) |
|
``` |
|
|
|
This code demonstrates how to: |
|
1. Load the base Florence-2 model |
|
2. Load the LoRA adapter |
|
3. Process an image and generate a detailed caption |
|
|
|
Note: Make sure you have the required libraries installed: transformers, peft, einops, flash_attn, timm, Pillow, and requests. |
|
|
|
## Evaluation results |
|
|
|
Our LoRA adapter shows improvements over the base Florence-2 model across all metrics for MORE_DETAILED_CAPTION tag for 1000 images on the foundation-multimodal-models/DetailCaps-4870 dataset: |
|
|
|
| Metric | Base Model | Adapted Model | Improvement | |
|
|---------|------------|-----------------------|-------------| |
|
| CAPTURE | 0.546 | 0.553 | +1.3% | |
|
| METEOR | 0.213 | 0.240 | +12.7% | |
|
| BLEU | 0.110 | 0.150 | +36.4% | |
|
| CIDEr | 0.031 | 0.035 | +12.9% | |
|
| ROUGE-L | 0.275 | 0.294 | +6.9% | |
|
|
|
These results demonstrate that our LoRA adapter enhances the image captioning capabilities of the Florence-2 base model, particularly in generating more detailed and accurate captions. |