docs/prompt_yolo_world.md · wondervictor/YOLO-World-Image at main

Prompt YOLO-World

1. Simple YOLO-World with Embeddings

For simplifying YOLO-World and get rid of the language model, we define a new basic detector SimpleYOLOWorldDetector:

The SimpleYOLOWorldDetector supports prompt embeddings as the input and doesn't not contain a language model anymore! Now, YOLO-World adopts embeddings as language inputs, and the embeddings support several kinds: (1) text embeddings from the language model, e.g., CLIP language encoder, (2) image embeddings from a vision model, e.g., CLIP vision encoder, and (3) image-text fused embeddings, and (4) random embeddings. The (1)(2)(3) supports zero-shot inference and (4), including (1)(2)(3) are designed for prompt tuning on your custom data.

The basic detector is defined as follows:

class SimpleYOLOWorldDetector(YOLODetector):
    """Implementation of YOLO World Series"""

    def __init__(self,
                 *args,
                 mm_neck: bool = False,
                 num_train_classes=80,
                 num_test_classes=80,
                 prompt_dim=512,
                 num_prompts=80,
                 embedding_path='',
                 freeze_prompt=False,
                 use_mlp_adapter=False,
                 **kwargs)

To use it in a zero-shot manner, you need to pre-compute the text embeddings (image embeddings) and save it as a numpy array (*.npy) with a NxD shape (N is the number of prompts and D is the dimension of the embeddings). Currently, we only support one prompt for one class. You can use several prompts for one class but you need to merge the results in the post-processing steps.

2. Prompt Tuning YOLO-World

We introduce prompt tuning for YOLO-World to maintain the zero-shot ability while improve the performance on your custom datasets.

For more details about writing configs for prompt tuning, you can refer to prompt tuning for COCO data.

Use random prompts

dict(type='SimpleYOLOWorldDetector',
             mm_neck=True,
             num_train_classes=num_training_classes,
             num_test_classes=num_classes,
             prompt_dim=text_channels,
             num_prompts=80,
             ...)

Use CLIP embeddings (text, image, or text-image embeddings)

the clip_vit_b32_coco_80_embeddings.npy can be downloaded at HuggingFace.

dict(type='SimpleYOLOWorldDetector',
             mm_neck=True,
             num_train_classes=num_training_classes,
             num_test_classes=num_classes,
             embedding_path='embeddings/clip_vit_b32_coco_80_embeddings.npy',
             prompt_dim=text_channels,
             num_prompts=80,
             ...)

Using CLIP model to obtains the image and text embeddings will maintain the zero-shot performace.

Model	Config	AP	AP50	AP75	APS	APM	APL
YOLO-World-v2-L	Zero-shot	45.7	61.6	49.8	29.9	50.0	60.8
YOLO-World-v2-L	Prompt tuning	47.9	64.3	52.5	31.9	52.6	61.3