metadata

title: OmniScience -- Agentic Imaging Analysis
emoji: 🔬🧫
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: openrail

Using Landing AI's Vision Agent to architect an app for brain tumor detection

a quick overview of the inner workings of LandingAI's Vision Agent, how it breaks down an initial user requirement to identify candidate components in the application architecture.
the diagram below captures what I had in mind for a multi-agent system implementation -- but LandingAI's vision agent starts this much earlier, taking a fresh approach on old school architecture trade-off analysis.
the design-time flow in the most recent version of the official Vision Agent app has shifted somewhat, but the number of concepts it helped bring together for me was amazing.
if you want a deeper understanding of the run-time flow of the application I encourage you to instrument it with Weave. Additional information on how to instrument the app can be found in this GitHub repo.

Summary of Vision Agent design-time workflow

MY PROMPT: use detection (bounding box) and segmentation (segmenation and mask) techniques to detect brain tumors in the image.

Creating instructions

Load the image from the given file path '/home/user/xxxx.jpg'.
Use the 'owl_v2' tool to detect brain tumors in the image. The prompt should be 'brain tumor'.
Use the 'grounding_sam' tool to segment brain tumors in the image. The prompt should be 'brain tumor'.
Overlay the bounding boxes from the detection results on the original image using the 'overlay_bounding_boxes' utility.
Overlay the segmentation masks from the segmentation results on the original image using the 'overlay_segmentation_masks' utility.
Save the final image with both bounding boxes and segmentation masks to a specified output path.

Retrieving Vision Agent tools

'load_image' is a utility function that loads an image from the given file path string. 'save_image' is a utility function that saves an image to a file path.
'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.
'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes.
'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores.
'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores.
'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image.
'overlay_heat_map' is a utility function that displays a heat map on an image.
'overlay_segmentation_masks' is a utility function that displays segmentation masks.

Retrieving tools - detailed notes from Vision Agent tool selection

load_image(image_path: str) -> numpy.ndarray: 'load_image' is a utility function that loads an image from the given file path string.

Parameters:
    image_path (str): The path to the image.

Returns:
    np.ndarray: The image as a NumPy array.

Example
-------
    >>> load_image("path/to/image.jpg")

save_image(image: numpy.ndarray, file_path: str) -> None: 'save_image' is a utility function that saves an image to a file path.

Parameters:
    image (np.ndarray): The image to save.
    file_path (str): The path to save the image file.

Example
-------
    >>> save_image(image)

owl_v2(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, iou_threshold: float = 0.1) -> List[Dict[str, Any]]: 'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.

Parameters:
    prompt (str): The prompt to ground to the image.
    image (np.ndarray): The image to ground the prompt to.
    box_threshold (float, optional): The threshold for the box detection. Defaults
        to 0.10.
    iou_threshold (float, optional): The threshold for the Intersection over Union
        (IoU). Defaults to 0.10.

Returns:
    List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
        bounding box of the detected objects with normalized coordinates between 0
        and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the
        top-left and xmax and ymax are the coordinates of the bottom-right of the
        bounding box.

Example
-------
    >>> owl_v2("car. dinosaur", image)
    [
        {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]},
        {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
    ]

florencev2_object_detection(image: numpy.ndarray) -> List[Dict[str, Any]]: 'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes.

Parameters:
    image (np.ndarray): The image to used to detect objects

Returns:
    List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
        bounding box of the detected objects with normalized coordinates between 0
        and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the
        top-left and xmax and ymax are the coordinates of the bottom-right of the
        bounding box. The scores are always 1.0 and cannot be thresholded

Example
-------
    >>> florencev2_object_detection(image)
    [
        {'score': 1.0, 'label': 'window', 'bbox': [0.1, 0.11, 0.35, 0.4]},
        {'score': 1.0, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
        {'score': 1.0, 'label': 'person', 'bbox': [0.34, 0.21, 0.85, 0.5},
    ]

grounding_sam(prompt: str, image: numpy.ndarray, box_threshold: float = 0.2, iou_threshold: float = 0.2) -> List[Dict[str, Any]]: 'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores.

Parameters:
    prompt (str): The prompt to ground to the image.
    image (np.ndarray): The image to ground the prompt to.
    box_threshold (float, optional): The threshold for the box detection. Defaults
        to 0.20.
    iou_threshold (float, optional): The threshold for the Intersection over Union
        (IoU). Defaults to 0.20.

Returns:
    List[Dict[str, Any]]: A list of dictionaries containing the score, label,
        bounding box, and mask of the detected objects with normalized coordinates
        (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
        and xmax and ymax are the coordinates of the bottom-right of the bounding box.
        The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
        the background.

Example
-------
    >>> grounding_sam("car. dinosaur", image)
    [
        {
            'score': 0.99,
            'label': 'dinosaur',
            'bbox': [0.1, 0.11, 0.35, 0.4],
            'mask': array([[0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0],
                ...,
                [0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
        },
    ]

detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]: 'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores.

Parameters:
    image (np.ndarray): The image used to segment things and objects

Returns:
    List[Dict[str, Any]]: A list of dictionaries containing the score, label
        and mask of the detected objects. The mask is binary 2D numpy array where 1
        indicates the object and 0 indicates the background.

Example
-------
    >>> detr_segmentation(image)
    [
        {
            'score': 0.45,
            'label': 'window',
            'mask': array([[0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0],
                ...,
                [0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
        },
        {
            'score': 0.70,
            'label': 'bird',
            'mask': array([[0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0],
                ...,
                [0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
        },
    ]

overlay_bounding_boxes(image: numpy.ndarray, bboxes: List[Dict[str, Any]]) -> numpy.ndarray: 'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image.

Parameters:
    image (np.ndarray): The image to display the bounding boxes on.
    bboxes (List[Dict[str, Any]]): A list of dictionaries containing the bounding
        boxes.

Returns:
    np.ndarray: The image with the bounding boxes, labels and scores displayed.

Example
-------
    >>> image_with_bboxes = overlay_bounding_boxes(
        image, [{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}],
    )

overlay_heat_map(image: numpy.ndarray, heat_map: Dict[str, Any], alpha: float = 0.8) -> numpy.ndarray: 'overlay_heat_map' is a utility function that displays a heat map on an image.

Parameters:
    image (np.ndarray): The image to display the heat map on.
    heat_map (Dict[str, Any]): A dictionary containing the heat map under the key
        'heat_map'.
    alpha (float, optional): The transparency of the overlay. Defaults to 0.8.

Returns:
    np.ndarray: The image with the heat map displayed.

Example
-------
    >>> image_with_heat_map = overlay_heat_map(
        image,
        {
            'heat_map': array([[0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0],
                ...,
                [0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 125, 125, 125]], dtype=uint8),
        },
    )

overlay_segmentation_masks(image: numpy.ndarray, masks: List[Dict[str, Any]]) -> numpy.ndarray: 'overlay_segmentation_masks' is a utility function that displays segmentation masks.

Parameters:
    image (np.ndarray): The image to display the masks on.
    masks (List[Dict[str, Any]]): A list of dictionaries containing the masks.

Returns:
    np.ndarray: The image with the masks displayed.

Example
-------
    >>> image_with_masks = overlay_segmentation_masks(
        image,
        [{
            'score': 0.99,
            'label': 'dinosaur',
            'mask': array([[0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0],
                ...,
                [0, 0, 0, ..., 0, 0, 0],
                [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
        }],
    )

Vision Agent Tools - model summary

any mistakes in the following table are mine. my efforts to do some QUICK reverse engineering to identify target models.

Model Name	Hugging Face Model	Primary Function	Use Cases
OWL-ViT v2	google/owlv2-base-patch16-ensemble	Object detection and localization	- Open-world object detection - Locating specific objects based on text prompts
Florence-2	microsoft/florence-base	Multi-purpose vision tasks	- Image captioning - Visual question answering - Object detection
Depth Anything V2	LiheYoung/depth-anything-v2-small	Depth estimation	- Estimating depth in images - Generating depth maps
CLIP	openai/clip-vit-base-patch32	Image-text similarity	- Zero-shot image classification - Image-text matching
BLIP	Salesforce/blip-image-captioning-base	Image captioning	- Generating text descriptions of images
LOCA	Custom implementation	Object counting	- Zero-shot object counting - Object counting with visual prompts
GIT v2	microsoft/git-base-vqav2	Visual question answering and image captioning	- Answering questions about image content - Generating text descriptions of images
Grounding DINO	groundingdino/groundingdino-swint-ogc	Object detection and localization	- Detecting objects based on text prompts
SAM	facebook/sam-vit-huge	Instance segmentation	- Text-prompted instance segmentation
DETR	facebook/detr-resnet-50	Object detection	- General object detection
ViT	google/vit-base-patch16-224	Image classification	- General image classification - NSFW content detection
DPT	Intel/dpt-hybrid-midas	Monocular depth estimation	- Estimating depth from single images