Documentation

by iiBLACKii - opened Oct 3

Oct 3

How can I pass an image to an LLM for analysis and obtain output in the form of text, bounding boxes, or coordinates? My current use case involves detecting specific UI elements, such as cards on a landing page, but the model isn't successfully identifying these elements. How can I improve the detection process to better recognize such elements?

danjeffries

AgentSea org Oct 3

Are you using the "detect" key word? Be very very very simple. detect X.

Examples:

detect 'calendar'
detect 'submit' button
detect 'file' drop down menu

Ensure your resolution is not too low in the image. Consider processing it with a non lossy format.

Consider using the 896 model as it is more performant and deals with more detail.

iiBLACKii

Oct 3

•

edited Oct 3

Like there is not specific but if I pass Image it should identify presented UI element from image like buttons, links any UI element. Coordinates would also be ok if possible . (Sorry if anything I am missing actually I am new to LLM)

nph4rd

AgentSea org Oct 3

Hey @iiBLACKii - as Dan said you only need to prompt the model with something like 'detect X'.

So, for instance, for the image you provided 'detect start your project button' should give you the normalized coordinates for that button.

iiBLACKii

Oct 4

•

edited Oct 4

Is it possible to detect UI element itself without 'detect X' ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment