Documentation
How can I pass an image to an LLM for analysis and obtain output in the form of text, bounding boxes, or coordinates? My current use case involves detecting specific UI elements, such as cards on a landing page, but the model isn't successfully identifying these elements. How can I improve the detection process to better recognize such elements?
Are you using the "detect" key word? Be very very very simple. detect X.
Examples:
detect 'calendar'
detect 'submit' button
detect 'file' drop down menu
Ensure your resolution is not too low in the image. Consider processing it with a non lossy format.
Consider using the 896 model as it is more performant and deals with more detail.
Hey @iiBLACKii - as Dan said you only need to prompt the model with something like 'detect X'.
So, for instance, for the image you provided 'detect start your project button' should give you the normalized coordinates for that button.
Is it possible to detect UI element itself without 'detect X' ?