File size: 2,287 Bytes
0284c3d
 
62e50ac
 
 
 
 
 
 
8193f11
 
 
 
 
 
 
 
0284c3d
 
 
8193f11
 
9add8e1
8193f11
 
 
 
206dfba
8193f11
 
 
 
 
 
206dfba
 
8193f11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0284c3d
8193f11
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
library_name: transformers
license: mit
datasets:
- HuggingFaceM4/DocumentVQA
language:
- en
base_model:
- microsoft/Florence-2-base
tags:
- transformers
- florence2
- document-vqa
- vqa
- image-to-text
- multimodal
- question-answering
---


# Model Description
Fine-tuned Florence-2 model on DocumentVQA dataset to perform question answering on document images
- **[Github](https://github.com/sahilnishad/Fine-Tuning-Florence-2-DocumentVQA)**

# Get Started with the Model
#### 1. Installation
```python
!pip install torch transformers datasets flash_attn
```
#### 2. Loading model and processor
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
```
#### 3. Running inference
```python
def run_inference(task_prompt, question, image):
    prompt = task_prompt + question

    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            num_beams=3
        )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return generated_text
```
#### 4. Example
```python
from PIL import Image
from datasets import load_dataset

data = load_dataset("HuggingFaceM4/DocumentVQA")

question = "What do you see in this image?"
image = data['train'][0]['image']
print(run_inference("<DocVQA>", question, image))
```
---

# BibTeX:
```bibtex
@misc{sahilnishad_florence_2_ft_docvqa,
  author       = {Sahil Nishad},
  title        = {Fine-Tuning Florence-2 For Document Visual Question-Answering},
  year         = {2024},
  url          = {https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA},
  note         = {Model available on HuggingFace Hub},
  howpublished = {\url{https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}},
}