Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Abstract
Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code and models are available at https://github.com/bdaiinstitute/theia.
Community
Theia builds a robot vision foundation model by distilling existing vision foundation models, which improves downstream robot learning performance, as well as has a smaller model size.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning Manipulation by Predicting Interaction (2024)
- Pretrained Visual Representations in Reinforcement Learning (2024)
- LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning (2024)
- OpenVLA: An Open-Source Vision-Language-Action Model (2024)
- HRP: Human Affordances for Robotic Pre-Training (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper