YOLO-World was designed to solve a limitation of existing zero-shot object detection models: speed. Whereas other state-of-the-art models use Transformers, a powerful but typically slower architecture, YOLO-World uses the faster CNN-based YOLO architecture.
YOLO-World provides three models: small with 13M (re-parametrized 77M), medium with 29M (re-parametrized 92M), and large with 48M (re-parametrized 110M) parameters.
The YOLO-World team benchmarked the model on the LVIS dataset and measured their performance on the V100 without any performance acceleration mechanisms like quantization or TensorRT.
According to the paper, YOLO-World reached 35.4 AP with 52.0 FPS for the L version and 26.2 AP with 74.1 FPS for the S version. While the V100 is a powerful GPU, achieving such high FPS on any device is impressive.
Key Steps: 1. Vehicle Detection: Before we jump into speed estimation, we begin by detecting moving vehicles. I demonstrate this using YOLOv8, deployed through the Inference pip package.
2. Tracking with ByteTrack: For effective object tracking, ByteTrack is my tool of choice. It assigns a unique ID to each vehicle, which is essential for accurately monitoring the distance each car travels. This forms the cornerstone of our speed calculation process.
3. Distance Calculation Complexities: Calculating traveled distance can be tricky due to perspective distortion from the camera. A car moving at a constant speed will appear to move a different number of pixels in the image, depending on its distance from the camera.
4. Vehicle Positioning: We can accurately pinpoint each vehicle's position within our monitored area. By representing each vehicle with x and y coordinates in meters, we can compare its current and past positions, paving the way for calculating its speed.
5. We store the position of each car in the last second, calculate the offset, and divide it by the time delta to get the local speed.