--- title: OmniScience -- Agentic Imaging Analysis emoji: 🔬🧫 colorFrom: indigo colorTo: gray sdk: gradio sdk_version: 4.36.1 app_file: app.py pinned: false license: openrail --- # Using Landing AI's Vision Agent to architect an app for brain tumor detection - a quick overview of the inner workings of LandingAI's Vision Agent, how it breaks down an initial user requirement to identify candidate components in the application architecture. - the diagram below captures what I had in mind for a multi-agent system implementation -- but LandingAI's vision agent starts this much earlier, taking a fresh approach on old school architecture trade-off analysis. - the design-time flow in the most recent version of the official [Vision Agent](https://va.landing.ai/agent) app has shifted somewhat, but the number of concepts it helped bring together for me was amazing. - if you want a deeper understanding of the run-time flow of the application I encourage you to instrument it with Weave. Additional information on how to instrument the app can be found in [this GitHub repo](https://github.com/donbr/vision-agent). ![image/png](https://cdn-uploads.huggingface.co/production/uploads/653d62fab16f657d28ce2cf2/KPV1Szj6IkY457n3Hqjl6.png) # Summary of Vision Agent design-time workflow MY PROMPT: use detection (bounding box) and segmentation (segmenation and mask) techniques to detect brain tumors in the image. ## Creating instructions - Load the image from the given file path '/home/user/xxxx.jpg'. - Use the 'owl_v2' tool to detect brain tumors in the image. The prompt should be 'brain tumor'. - Use the 'grounding_sam' tool to segment brain tumors in the image. The prompt should be 'brain tumor'. - Overlay the bounding boxes from the detection results on the original image using the 'overlay_bounding_boxes' utility. - Overlay the segmentation masks from the segmentation results on the original image using the 'overlay_segmentation_masks' utility. - Save the final image with both bounding boxes and segmentation masks to a specified output path. ## Retrieving Vision Agent tools - 'load_image' is a utility function that loads an image from the given file path string. 'save_image' is a utility function that saves an image to a file path. - 'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores. - 'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes. - 'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores. - 'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores. - 'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image. - 'overlay_heat_map' is a utility function that displays a heat map on an image. - 'overlay_segmentation_masks' is a utility function that displays segmentation masks. ### Retrieving tools - detailed notes from Vision Agent tool selection load_image(image_path: str) -> numpy.ndarray: 'load_image' is a utility function that loads an image from the given file path string. Parameters: image_path (str): The path to the image. Returns: np.ndarray: The image as a NumPy array. Example ------- >>> load_image("path/to/image.jpg") save_image(image: numpy.ndarray, file_path: str) -> None: 'save_image' is a utility function that saves an image to a file path. Parameters: image (np.ndarray): The image to save. file_path (str): The path to save the image file. Example ------- >>> save_image(image) owl_v2(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, iou_threshold: float = 0.1) -> List[Dict[str, Any]]: 'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores. Parameters: prompt (str): The prompt to ground to the image. image (np.ndarray): The image to ground the prompt to. box_threshold (float, optional): The threshold for the box detection. Defaults to 0.10. iou_threshold (float, optional): The threshold for the Intersection over Union (IoU). Defaults to 0.10. Returns: List[Dict[str, Any]]: A list of dictionaries containing the score, label, and bounding box of the detected objects with normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left and xmax and ymax are the coordinates of the bottom-right of the bounding box. Example ------- >>> owl_v2("car. dinosaur", image) [ {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}, {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5}, ] florencev2_object_detection(image: numpy.ndarray) -> List[Dict[str, Any]]: 'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes. Parameters: image (np.ndarray): The image to used to detect objects Returns: List[Dict[str, Any]]: A list of dictionaries containing the score, label, and bounding box of the detected objects with normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left and xmax and ymax are the coordinates of the bottom-right of the bounding box. The scores are always 1.0 and cannot be thresholded Example ------- >>> florencev2_object_detection(image) [ {'score': 1.0, 'label': 'window', 'bbox': [0.1, 0.11, 0.35, 0.4]}, {'score': 1.0, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5}, {'score': 1.0, 'label': 'person', 'bbox': [0.34, 0.21, 0.85, 0.5}, ] grounding_sam(prompt: str, image: numpy.ndarray, box_threshold: float = 0.2, iou_threshold: float = 0.2) -> List[Dict[str, Any]]: 'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores. Parameters: prompt (str): The prompt to ground to the image. image (np.ndarray): The image to ground the prompt to. box_threshold (float, optional): The threshold for the box detection. Defaults to 0.20. iou_threshold (float, optional): The threshold for the Intersection over Union (IoU). Defaults to 0.20. Returns: List[Dict[str, Any]]: A list of dictionaries containing the score, label, bounding box, and mask of the detected objects with normalized coordinates (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left and xmax and ymax are the coordinates of the bottom-right of the bounding box. The mask is binary 2D numpy array where 1 indicates the object and 0 indicates the background. Example ------- >>> grounding_sam("car. dinosaur", image) [ { 'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4], 'mask': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), }, ] detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]: 'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores. Parameters: image (np.ndarray): The image used to segment things and objects Returns: List[Dict[str, Any]]: A list of dictionaries containing the score, label and mask of the detected objects. The mask is binary 2D numpy array where 1 indicates the object and 0 indicates the background. Example ------- >>> detr_segmentation(image) [ { 'score': 0.45, 'label': 'window', 'mask': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), }, { 'score': 0.70, 'label': 'bird', 'mask': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), }, ] overlay_bounding_boxes(image: numpy.ndarray, bboxes: List[Dict[str, Any]]) -> numpy.ndarray: 'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image. Parameters: image (np.ndarray): The image to display the bounding boxes on. bboxes (List[Dict[str, Any]]): A list of dictionaries containing the bounding boxes. Returns: np.ndarray: The image with the bounding boxes, labels and scores displayed. Example ------- >>> image_with_bboxes = overlay_bounding_boxes( image, [{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}], ) overlay_heat_map(image: numpy.ndarray, heat_map: Dict[str, Any], alpha: float = 0.8) -> numpy.ndarray: 'overlay_heat_map' is a utility function that displays a heat map on an image. Parameters: image (np.ndarray): The image to display the heat map on. heat_map (Dict[str, Any]): A dictionary containing the heat map under the key 'heat_map'. alpha (float, optional): The transparency of the overlay. Defaults to 0.8. Returns: np.ndarray: The image with the heat map displayed. Example ------- >>> image_with_heat_map = overlay_heat_map( image, { 'heat_map': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 125, 125, 125]], dtype=uint8), }, ) overlay_segmentation_masks(image: numpy.ndarray, masks: List[Dict[str, Any]]) -> numpy.ndarray: 'overlay_segmentation_masks' is a utility function that displays segmentation masks. Parameters: image (np.ndarray): The image to display the masks on. masks (List[Dict[str, Any]]): A list of dictionaries containing the masks. Returns: np.ndarray: The image with the masks displayed. Example ------- >>> image_with_masks = overlay_segmentation_masks( image, [{ 'score': 0.99, 'label': 'dinosaur', 'mask': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), }], ) ## Vision Agent Tools - model summary - any mistakes in the following table are mine. my efforts to do some QUICK reverse engineering to identify target models. | Model Name | Hugging Face Model | Primary Function | Use Cases | |---------------------|-------------------------------------|-------------------------------|--------------------------------------------------------------| | OWL-ViT v2 | google/owlv2-base-patch16-ensemble | Object detection and localization | - Open-world object detection
- Locating specific objects based on text prompts | | Florence-2 | microsoft/florence-base | Multi-purpose vision tasks | - Image captioning
- Visual question answering
- Object detection | | Depth Anything V2 | LiheYoung/depth-anything-v2-small | Depth estimation | - Estimating depth in images
- Generating depth maps | | CLIP | openai/clip-vit-base-patch32 | Image-text similarity | - Zero-shot image classification
- Image-text matching | | BLIP | Salesforce/blip-image-captioning-base | Image captioning | - Generating text descriptions of images | | LOCA | Custom implementation | Object counting | - Zero-shot object counting
- Object counting with visual prompts | | GIT v2 | microsoft/git-base-vqav2 | Visual question answering and image captioning | - Answering questions about image content
- Generating text descriptions of images | | Grounding DINO | groundingdino/groundingdino-swint-ogc | Object detection and localization | - Detecting objects based on text prompts | | SAM | facebook/sam-vit-huge | Instance segmentation | - Text-prompted instance segmentation | | DETR | facebook/detr-resnet-50 | Object detection | - General object detection | | ViT | google/vit-base-patch16-224 | Image classification | - General image classification
- NSFW content detection | | DPT | Intel/dpt-hybrid-midas | Monocular depth estimation | - Estimating depth from single images |