|
Conducting age and gender recognition in real-world scenarios is a task replete with challenges: not only are there variable environmental conditions, complex poses, and differences in image quality, but there are also situations where the face is partially or fully obscured. MiVOLO is a straightforward approach that leverages the latest visual Transformer for age and gender estimation. This method integrates these two tasks into a unified dual-input/output model, utilizing not only facial information but also full-body image data. This enhances the model's generalization capabilities, allowing it to provide satisfactory results even when the face is not visible in the image. To evaluate the model, experiments were conducted on four popular benchmark datasets, achieving state-of-the-art performance while also demonstrating the ability to process in real-time. Additionally, a new benchmark dataset was introduced based on images from the Open Images dataset. The ground truth annotations of this benchmark were meticulously created by human annotators and ensured high accuracy through intelligent aggregation of voting results. Furthermore, the model's age recognition performance was compared with human-level accuracy, showing a clear superiority over humans across most age ranges. Finally, public access to the model was provided, along with code for verification and inference. Moreover, additional annotations for the datasets used were supplied, and the new benchmark dataset was introduced. |