lbourdois's picture
Upload 174 files
94e735e verified
|
raw
history blame
1.8 kB

Real-time DEtection Transformer (RT-DETR) landed in @huggingface transformers 🤩 with Apache 2.0 license 😍
Do DETRs Beat YOLOs on Real-time Object Detection? keep reading 👀

video_1

Short answer, it does!
📖 notebook, 🔖 models, 🔖 demo

YOLO models are known to be super fast for real-time computer vision, but they have a downside with being volatile to NMS 🥲
Transformer-based models on the other hand are computationally not as efficient 🥲 Isn't there something in between? Enter RT-DETR!

The authors combined CNN backbone, multi-stage hybrid decoder (combining convs and attn) with a transformer decoder ⇓

image_1

In the paper, authors also claim one can adjust speed by changing decoder layers without retraining altogether they also conduct many ablation studies and try different decoders (see below)

image_2

The authors find out that the model performs better in terms of speed and accuracy compared to the previous state-of-the-art 🤩

image_3

According to authors' findings, it performs way better than many of the existing models (including proprietary VLMs) and scales very well (on text decoder)

Ressources:
DETRs Beat YOLOs on Real-time Object Detection by Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen (2023) GitHub
Hugging Face documentation

Original tweet (July 1, 2024)