Spaces:
Running
Running
title: OmniScience -- Agentic Imaging Analysis | |
emoji: 🔬🧫 | |
colorFrom: indigo | |
colorTo: gray | |
sdk: gradio | |
sdk_version: 4.36.1 | |
app_file: app.py | |
pinned: false | |
license: openrail | |
# Using Landing AI's Vision Agent to architect an app for brain tumor detection | |
- a quick overview of the inner workings of LandingAI's Vision Agent, how it breaks down an initial user requirement to identify candidate components in the application architecture. | |
- the diagram below captures what I had in mind for a multi-agent system implementation -- but LandingAI's vision agent starts this much earlier, taking a fresh approach on old school architecture trade-off analysis. | |
- the design-time flow in the most recent version of the official [Vision Agent](https://va.landing.ai/agent) app has shifted somewhat, but the number of concepts it helped bring together for me was amazing. | |
- if you want a deeper understanding of the run-time flow of the application I encourage you to instrument it with Weave. Additional information on how to instrument the app can be found in [this GitHub repo](https://github.com/donbr/vision-agent). | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/653d62fab16f657d28ce2cf2/KPV1Szj6IkY457n3Hqjl6.png) | |
# Summary of Vision Agent design-time workflow | |
MY PROMPT: use detection (bounding box) and segmentation (segmenation and mask) techniques to detect brain tumors in the image. | |
## Creating instructions | |
- Load the image from the given file path '/home/user/xxxx.jpg'. | |
- Use the 'owl_v2' tool to detect brain tumors in the image. The prompt should be 'brain tumor'. | |
- Use the 'grounding_sam' tool to segment brain tumors in the image. The prompt should be 'brain tumor'. | |
- Overlay the bounding boxes from the detection results on the original image using the 'overlay_bounding_boxes' utility. | |
- Overlay the segmentation masks from the segmentation results on the original image using the 'overlay_segmentation_masks' utility. | |
- Save the final image with both bounding boxes and segmentation masks to a specified output path. | |
## Retrieving Vision Agent tools | |
- 'load_image' is a utility function that loads an image from the given file path string. | |
'save_image' is a utility function that saves an image to a file path. | |
- 'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores. | |
- 'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes. | |
- 'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores. | |
- 'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores. | |
- 'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image. | |
- 'overlay_heat_map' is a utility function that displays a heat map on an image. | |
- 'overlay_segmentation_masks' is a utility function that displays segmentation masks. | |
### Retrieving tools - detailed notes from Vision Agent tool selection | |
load_image(image_path: str) -> numpy.ndarray: | |
'load_image' is a utility function that loads an image from the given file path string. | |
Parameters: | |
image_path (str): The path to the image. | |
Returns: | |
np.ndarray: The image as a NumPy array. | |
Example | |
------- | |
>>> load_image("path/to/image.jpg") | |
save_image(image: numpy.ndarray, file_path: str) -> None: | |
'save_image' is a utility function that saves an image to a file path. | |
Parameters: | |
image (np.ndarray): The image to save. | |
file_path (str): The path to save the image file. | |
Example | |
------- | |
>>> save_image(image) | |
owl_v2(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, iou_threshold: float = 0.1) -> List[Dict[str, Any]]: | |
'owl_v2' is a tool that can detect and count multiple objects given a text | |
prompt such as category names or referring expressions. The categories in text prompt | |
are separated by commas. It returns a list of bounding boxes with | |
normalized coordinates, label names and associated probability scores. | |
Parameters: | |
prompt (str): The prompt to ground to the image. | |
image (np.ndarray): The image to ground the prompt to. | |
box_threshold (float, optional): The threshold for the box detection. Defaults | |
to 0.10. | |
iou_threshold (float, optional): The threshold for the Intersection over Union | |
(IoU). Defaults to 0.10. | |
Returns: | |
List[Dict[str, Any]]: A list of dictionaries containing the score, label, and | |
bounding box of the detected objects with normalized coordinates between 0 | |
and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the | |
top-left and xmax and ymax are the coordinates of the bottom-right of the | |
bounding box. | |
Example | |
------- | |
>>> owl_v2("car. dinosaur", image) | |
[ | |
{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}, | |
{'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5}, | |
] | |
florencev2_object_detection(image: numpy.ndarray) -> List[Dict[str, Any]]: | |
'florencev2_object_detection' is a tool that can detect common objects in an | |
image without any text prompt or thresholding. It returns a list of detected objects | |
as labels and their location as bounding boxes. | |
Parameters: | |
image (np.ndarray): The image to used to detect objects | |
Returns: | |
List[Dict[str, Any]]: A list of dictionaries containing the score, label, and | |
bounding box of the detected objects with normalized coordinates between 0 | |
and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the | |
top-left and xmax and ymax are the coordinates of the bottom-right of the | |
bounding box. The scores are always 1.0 and cannot be thresholded | |
Example | |
------- | |
>>> florencev2_object_detection(image) | |
[ | |
{'score': 1.0, 'label': 'window', 'bbox': [0.1, 0.11, 0.35, 0.4]}, | |
{'score': 1.0, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5}, | |
{'score': 1.0, 'label': 'person', 'bbox': [0.34, 0.21, 0.85, 0.5}, | |
] | |
grounding_sam(prompt: str, image: numpy.ndarray, box_threshold: float = 0.2, iou_threshold: float = 0.2) -> List[Dict[str, Any]]: | |
'grounding_sam' is a tool that can segment multiple objects given a | |
text prompt such as category names or referring expressions. The categories in text | |
prompt are separated by commas or periods. It returns a list of bounding boxes, | |
label names, mask file names and associated probability scores. | |
Parameters: | |
prompt (str): The prompt to ground to the image. | |
image (np.ndarray): The image to ground the prompt to. | |
box_threshold (float, optional): The threshold for the box detection. Defaults | |
to 0.20. | |
iou_threshold (float, optional): The threshold for the Intersection over Union | |
(IoU). Defaults to 0.20. | |
Returns: | |
List[Dict[str, Any]]: A list of dictionaries containing the score, label, | |
bounding box, and mask of the detected objects with normalized coordinates | |
(xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left | |
and xmax and ymax are the coordinates of the bottom-right of the bounding box. | |
The mask is binary 2D numpy array where 1 indicates the object and 0 indicates | |
the background. | |
Example | |
------- | |
>>> grounding_sam("car. dinosaur", image) | |
[ | |
{ | |
'score': 0.99, | |
'label': 'dinosaur', | |
'bbox': [0.1, 0.11, 0.35, 0.4], | |
'mask': array([[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0], | |
..., | |
[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8), | |
}, | |
] | |
detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]: | |
'detr_segmentation' is a tool that can segment common objects in an | |
image without any text prompt. It returns a list of detected objects | |
as labels, their regions as masks and their scores. | |
Parameters: | |
image (np.ndarray): The image used to segment things and objects | |
Returns: | |
List[Dict[str, Any]]: A list of dictionaries containing the score, label | |
and mask of the detected objects. The mask is binary 2D numpy array where 1 | |
indicates the object and 0 indicates the background. | |
Example | |
------- | |
>>> detr_segmentation(image) | |
[ | |
{ | |
'score': 0.45, | |
'label': 'window', | |
'mask': array([[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0], | |
..., | |
[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8), | |
}, | |
{ | |
'score': 0.70, | |
'label': 'bird', | |
'mask': array([[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0], | |
..., | |
[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8), | |
}, | |
] | |
overlay_bounding_boxes(image: numpy.ndarray, bboxes: List[Dict[str, Any]]) -> numpy.ndarray: | |
'overlay_bounding_boxes' is a utility function that displays bounding boxes on | |
an image. | |
Parameters: | |
image (np.ndarray): The image to display the bounding boxes on. | |
bboxes (List[Dict[str, Any]]): A list of dictionaries containing the bounding | |
boxes. | |
Returns: | |
np.ndarray: The image with the bounding boxes, labels and scores displayed. | |
Example | |
------- | |
>>> image_with_bboxes = overlay_bounding_boxes( | |
image, [{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}], | |
) | |
overlay_heat_map(image: numpy.ndarray, heat_map: Dict[str, Any], alpha: float = 0.8) -> numpy.ndarray: | |
'overlay_heat_map' is a utility function that displays a heat map on an image. | |
Parameters: | |
image (np.ndarray): The image to display the heat map on. | |
heat_map (Dict[str, Any]): A dictionary containing the heat map under the key | |
'heat_map'. | |
alpha (float, optional): The transparency of the overlay. Defaults to 0.8. | |
Returns: | |
np.ndarray: The image with the heat map displayed. | |
Example | |
------- | |
>>> image_with_heat_map = overlay_heat_map( | |
image, | |
{ | |
'heat_map': array([[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0], | |
..., | |
[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 125, 125, 125]], dtype=uint8), | |
}, | |
) | |
overlay_segmentation_masks(image: numpy.ndarray, masks: List[Dict[str, Any]]) -> numpy.ndarray: | |
'overlay_segmentation_masks' is a utility function that displays segmentation | |
masks. | |
Parameters: | |
image (np.ndarray): The image to display the masks on. | |
masks (List[Dict[str, Any]]): A list of dictionaries containing the masks. | |
Returns: | |
np.ndarray: The image with the masks displayed. | |
Example | |
------- | |
>>> image_with_masks = overlay_segmentation_masks( | |
image, | |
[{ | |
'score': 0.99, | |
'label': 'dinosaur', | |
'mask': array([[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0], | |
..., | |
[0, 0, 0, ..., 0, 0, 0], | |
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8), | |
}], | |
) | |
## Vision Agent Tools - model summary | |
- any mistakes in the following table are mine. my efforts to do some QUICK reverse engineering to identify target models. | |
| Model Name | Hugging Face Model | Primary Function | Use Cases | | |
|---------------------|-------------------------------------|-------------------------------|--------------------------------------------------------------| | |
| OWL-ViT v2 | google/owlv2-base-patch16-ensemble | Object detection and localization | - Open-world object detection<br>- Locating specific objects based on text prompts | | |
| Florence-2 | microsoft/florence-base | Multi-purpose vision tasks | - Image captioning<br>- Visual question answering<br>- Object detection | | |
| Depth Anything V2 | LiheYoung/depth-anything-v2-small | Depth estimation | - Estimating depth in images<br>- Generating depth maps | | |
| CLIP | openai/clip-vit-base-patch32 | Image-text similarity | - Zero-shot image classification<br>- Image-text matching | | |
| BLIP | Salesforce/blip-image-captioning-base | Image captioning | - Generating text descriptions of images | | |
| LOCA | Custom implementation | Object counting | - Zero-shot object counting<br>- Object counting with visual prompts | | |
| GIT v2 | microsoft/git-base-vqav2 | Visual question answering and image captioning | - Answering questions about image content<br>- Generating text descriptions of images | | |
| Grounding DINO | groundingdino/groundingdino-swint-ogc | Object detection and localization | - Detecting objects based on text prompts | | |
| SAM | facebook/sam-vit-huge | Instance segmentation | - Text-prompted instance segmentation | | |
| DETR | facebook/detr-resnet-50 | Object detection | - General object detection | | |
| ViT | google/vit-base-patch16-224 | Image classification | - General image classification<br>- NSFW content detection | | |
| DPT | Intel/dpt-hybrid-midas | Monocular depth estimation | - Estimating depth from single images | | |