|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- HuggingFaceM4/DocumentVQA |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/Florence-2-base |
|
tags: |
|
- transformers |
|
- florence2 |
|
- document-vqa |
|
- vqa |
|
- image-to-text |
|
- multimodal |
|
- question-answering |
|
--- |
|
|
|
|
|
# Model Description |
|
Fine-tuned Florence-2 model on DocumentVQA dataset to perform question answering on document images |
|
- **[Github](https://github.com/sahilnishad/Fine-Tuning-Florence-2-DocumentVQA)** |
|
|
|
# Get Started with the Model |
|
#### 1. Installation |
|
```python |
|
!pip install torch transformers datasets flash_attn |
|
``` |
|
#### 2. Loading model and processor |
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
|
model = AutoModelForCausalLM.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True) |
|
processor = AutoProcessor.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True) |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
``` |
|
#### 3. Running inference |
|
```python |
|
def run_inference(task_prompt, question, image): |
|
prompt = task_prompt + question |
|
|
|
if image.mode != "RGB": |
|
image = image.convert("RGB") |
|
|
|
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device) |
|
|
|
with torch.no_grad(): |
|
generated_ids = model.generate( |
|
input_ids=inputs["input_ids"], |
|
pixel_values=inputs["pixel_values"], |
|
max_new_tokens=1024, |
|
num_beams=3 |
|
) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
return generated_text |
|
``` |
|
#### 4. Example |
|
```python |
|
from PIL import Image |
|
from datasets import load_dataset |
|
|
|
data = load_dataset("HuggingFaceM4/DocumentVQA") |
|
|
|
question = "What do you see in this image?" |
|
image = data['train'][0]['image'] |
|
print(run_inference("<DocVQA>", question, image)) |
|
``` |
|
--- |
|
|
|
# BibTeX: |
|
```bibtex |
|
@misc{sahilnishad_florence_2_ft_docvqa, |
|
author = {Sahil Nishad}, |
|
title = {Fine-Tuning Florence-2 For Document Visual Question-Answering}, |
|
year = {2024}, |
|
url = {https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}, |
|
note = {Model available on HuggingFace Hub}, |
|
howpublished = {\url{https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}}, |
|
} |