|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
# RDT-1B |
|
|
|
RDT-1B is a 1B-parameter imitation learning Diffusion Transformer pre-trained on 1M+ multi-robot episodes. Given a language instruction and 3-view RGB image observations, RDT can predict the next |
|
64 robot actions. RDT is inherently compatible with almost all kinds of modern mobile manipulators, from single-arm to dual-arm, joint to EEF, pos. to vel., and even with a mobile chassis. |
|
|
|
All the [code]() and pretrained model weights are licensed under MIT license. |
|
|
|
Please refer to our [project page](https://rdt-robotics.github.io/rdt-robotics/) and [paper]() for more information. |
|
|
|
## Model Details |
|
|
|
- **Developed by** RDT team from Tsinghua University |
|
- **License:** MIT |
|
- **Language(s) (NLP):** en |
|
- **Model Architecture:** Diffusion Transformer |
|
- **Pretrain dataset:** Curated pretrain dataset collected from 46 datasets. Please see [here]() for detail |
|
- **Repository:** [repo_url] |
|
- **Paper :** [paper_url] |
|
- **Project Page:** https://rdt-robotics.github.io/rdt-robotics/ |
|
|
|
## Uses |
|
|
|
RDT takes language instruction, image observations and proprioception as input, and predicts the next 64 robot actions in the form of unified action space vector. |
|
The unified action space vector includes all the main physical quantities of robots (e.g. the end-effector and joint, position and velocity, base movement, etc.) and can be applied to a wide range of robotic embodiments. |
|
|
|
The pre-trained RDT model can be fine-tuned for specific robotic embodiment and deployed on real-world robots. |
|
Here's an example of how to use the RDT-1B model for inference on a Mobile-ALOHA robot: |
|
|
|
```python |
|
# Clone the repository and install dependencies |
|
from scripts.agilex_model import create_model |
|
# Names of cameras used for visual input |
|
CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist'] |
|
config = { |
|
'episode_len': 1000, # Max length of one episode |
|
'state_dim': 14, # Dimension of the robot's state |
|
'chunk_size': 64, # Number of actions to predict in one step |
|
'camera_names': CAMERA_NAMES, |
|
} |
|
pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384" |
|
# Create the model with specified configuration |
|
model = create_model( |
|
args=config, |
|
dtype=torch.bfloat16, |
|
pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path, |
|
control_frequency=25, |
|
) |
|
# Start inference process |
|
# Load pre-computed language embeddings |
|
lang_embeddings_path = 'your/language/embedding/path' |
|
text_embedding = torch.load(lang_embeddings_path)['embeddings'] |
|
images: List(PIL.Image) = ... # The images from last 2 frame |
|
proprio = ... # The current robot state |
|
# Perform inference to predict the next chunk_size actions |
|
actions = policy.step( |
|
proprio=proprio, |
|
images=images, |
|
text_embeds=text_embedding |
|
) |
|
``` |
|
|
|
RDT-1B supports finetuning on custom dataset, deploying and inferencing on real-robots, as well as pretraining the model. |
|
Please refer to [our repository](https://github.com/GeneralEmbodiedSystem/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md) for all the above guides. |
|
|
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
[More Information Needed] |