metadata

license: mit
language:
  - en
pipeline_tag: robotics
library_name: transformers
tags:
  - robotics
  - pytorch
  - diffusers
  - multimodal
  - pretraining
  - vla
  - diffusion
  - rdt

RDT-1B

RDT-1B is a 1B-parameter imitation learning Diffusion Transformer pre-trained on 1M+ multi-robot episodes. Given language instruction and RGB images of up to three views, RDT can predict the next 64 robot actions. RDT is compatible with almost all modern mobile manipulators, from single-arm to dual-arm, joint to EEF, pos. to vel., and even with a mobile chassis.

All the code and pre-trained model weights are licensed under the MIT license.

Please refer to our project page and paper for more information.

Model Details

Developed by: The RDT team consisting of researchers from the TSAIL group at Tsinghua University
Task Type: Vision-Language-Action (language, image => robot actions)
Modle Type: Diffusion Policy with Transformers
License: MIT
Language(s) (NLP): en
Multi-Modal Encoders:
- Vision Backbone: siglip-so400m-patch14-384
- Language Model: t5-v1_1-xxl
Pre-Training Datasets: 46 datasets consisting of RT-1 Dataset, RH20T, DROID, BridgeData V2, RoboSet, and a subset of Open X-Embodiment. See todo for a detailed list.
Repository: [repo_url]
Paper : [paper_url]
Project Page: https://rdt-robotics.github.io/rdt-robotics/

Uses

RDT takes language instruction, RGB image (of up to three views), control frequency (if any), and proprioception as input and predicts the next 64 robot actions in the form of the unified action space vector. The unified action space vector includes all the main physical quantities of the robot manipulator (e.g., the end-effector and joint, position and velocity, and base movement). To deploy on your robot platform, you need to pick the relevant quantities from the unified vector. See our repository for more information.

Out-of-Scope: Due to the embodiment gap, RDT cannot yet generalize to new robot platforms (not seen in the pre-training datasets). In this case, we recommend collecting a small dataset of the target robot and then using it to fine-tune RDT. See our repository for a tutorial.

Here's an example of how to use the RDT-1B model for inference on a robot:

# Clone the repository and install dependencies
from scripts.agilex_model import create_model
# Names of cameras used for visual input
CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist']
config = {
    'episode_len': 1000,  # Max length of one episode
    'state_dim': 14,      # Dimension of the robot's state
    'chunk_size': 64,     # Number of actions to predict in one step
    'camera_names': CAMERA_NAMES,
}
pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384" 
# Create the model with the specified configuration
model = create_model(
    args=config,
    dtype=torch.bfloat16, 
    pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path,
    pretrained='robotics-diffusion-transformer/rdt-1b',
    control_frequency=25,
)
# Start inference process
# Load pre-computed language embeddings
lang_embeddings_path = 'your/language/embedding/path' 
text_embedding = torch.load(lang_embeddings_path)['embeddings']  
images: List(PIL.Image) = ... #  The images from last 2 frame
proprio = ... # The current robot state
# Perform inference to predict the next chunk_size actions
actions = policy.step(
    proprio=proprio,
    images=images,
    text_embeds=text_embedding 
)

Citation

BibTeX:

[More Information Needed]