whether special instruction is need to trigger OCR location function?
#38
by
liupei0408
- opened
as mentioned above, whether special instruction is need for OCR location feature using Fuyu-8b to get same result as showing in blog?
+1
Hi
@liupei0408
,
@Nooodles
: you can try this from the new release of transformers
!
@pcuenq
worked on the bbox postprocessing, you can localise text by doing:
from PIL import Image
import requests
import io
from transformers import FuyuForCausalLM, FuyuProcessor
pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')
bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n Williams"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))
model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')
outputs = model.generate(**model_inputs, max_new_tokens=10)
post_processed_bbox_tokens = processor.post_process_box_coordinates(outputs)[0]
model_outputs = processor.decode(post_processed_bbox_tokens, skip_special_tokens=True)
prediction = model_outputs.split('\x04', 1)[1] if '\x04' in model_outputs else ''
prediction
will output the coordinates of the text Williams
in the image.
This comment has been hidden