--- library_name: transformers license: mit datasets: - HuggingFaceM4/DocumentVQA language: - en base_model: - microsoft/Florence-2-base tags: - transformers - florence2 - document-vqa - vqa - image-to-text - multimodal - question-answering --- # Model Description Fine-tuned Florence-2 model on DocumentVQA dataset to perform question answering on document images - **[Github](https://github.com/sahilnishad/Fine-Tuning-Florence-2-DocumentVQA)** # Get Started with the Model #### 1. Installation ```python !pip install torch transformers datasets flash_attn ``` #### 2. Loading model and processor ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor model = AutoModelForCausalLM.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True) processor = AutoProcessor.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) ``` #### 3. Running inference ```python def run_inference(task_prompt, question, image): prompt = task_prompt + question if image.mode != "RGB": image = image.convert("RGB") inputs = processor(text=prompt, images=image, return_tensors="pt").to(device) with torch.no_grad(): generated_ids = model.generate( input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, num_beams=3 ) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] return generated_text ``` #### 4. Example ```python from PIL import Image from datasets import load_dataset data = load_dataset("HuggingFaceM4/DocumentVQA") question = "What do you see in this image?" image = data['train'][0]['image'] print(run_inference("", question, image)) ``` --- # BibTeX: ```bibtex @misc{sahilnishad_florence_2_ft_docvqa, author = {Sahil Nishad}, title = {Fine-Tuning Florence-2 For Document Visual Question-Answering}, year = {2024}, url = {https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}, note = {Model available on HuggingFace Hub}, howpublished = {\url{https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}}, }