docs/finetuning.md · wondervictor/YOLO-World-Image at main

Fine-tuning YOLO-World

Fine-tuning YOLO-World is easy and we provide the samples for COCO object detection as a simple guidance.

Fine-tuning Requirements

Fine-tuning YOLO-World is cheap:

it does not require 32 GPUs for multi-node distributed training. 8 GPUs or even 1 GPU is enough.
it does not require the long schedule, e.g., 300 epochs or 500 epochs for training YOLOv5 or YOLOv8. 80 epochs or fewer is enough considering that we provide the good pre-trained weights.

Data Preparation

The fine-tuning dataset should have the similar format as the that of the pre-training dataset. We suggest you refer to docs/data for more details about how to build the datasets:

if you fine-tune YOLO-World for close-set / custom vocabulary object detection, using MultiModalDataset with a text json is preferred.
if you fine-tune YOLO-World for open-vocabulary detection with rich texts or grounding tasks, using MixedGroundingDataset is preferred.

Hyper-parameters and Config

Please refer to the config for fine-tuning YOLO-World-L on COCO for more details.

Basic config file:

If the fine-tuning dataset contains mask annotations:

_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')

If the fine-tuning dataset doesn't contain mask annotations:

_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py')

Training Schemes:

Reducing the epochs and adjusting the learning rate

max_epochs = 80
base_lr = 2e-4
weight_decay = 0.05
train_batch_size_per_gpu = 16
close_mosaic_epochs=10

train_cfg = dict(
    max_epochs=max_epochs,
    val_interval=5,
    dynamic_intervals=[((max_epochs - close_mosaic_epochs),
                        _base_.val_interval_stage2)])

Datasets:

coco_train_dataset = dict(
    _delete_=True,
    type='MultiModalDataset',
    dataset=dict(
        type='YOLOv5CocoDataset',
        data_root='data/coco',
        ann_file='annotations/instances_train2017.json',
        data_prefix=dict(img='train2017/'),
        filter_cfg=dict(filter_empty_gt=False, min_size=32)),
    class_text_path='data/texts/coco_class_texts.json',
    pipeline=train_pipeline)

Finetuning without RepVL-PAN or Text Encoder 🚀

For further efficiency and simplicity, we can fine-tune an efficient version of YOLO-World without RepVL-PAN and the text encoder. The efficient version of YOLO-World has the similar architecture or layers with the orignial YOLOv8 but we provide the pre-trained weights on large-scale datasets. The pre-trained YOLO-World has strong generalization capabilities and is more robust compared to YOLOv8 trained on the COCO dataset.

You can refer to the config for Efficient YOLO-World for more details.

The efficient YOLO-World adopts EfficientCSPLayerWithTwoConv and the text encoder can be removed during inference or exporting models.


model = dict(
    type='YOLOWorldDetector',
    mm_neck=True,
    neck=dict(type='YOLOWorldPAFPN',
              guide_channels=text_channels,
              embed_channels=neck_embed_channels,
              num_heads=neck_num_heads,
              block_cfg=dict(type='EfficientCSPLayerWithTwoConv')))

Launch Fine-tuning!

It's easy:

./dist_train.sh <path/to/config> <NUM_GPUS> --amp