File size: 3,277 Bytes
19c6069 ba0ed2f f03ec58 19c6069 ba0ed2f 19c6069 ba0ed2f f03ec58 ba0ed2f f03ec58 ba0ed2f a17bf07 ba0ed2f f03ec58 a17bf07 f03ec58 ba0ed2f 2443af2 f03ec58 2443af2 f062788 2443af2 f4fb9ee 97cdc39 f4fb9ee 5aa1106 f4fb9ee 5aa1106 f4fb9ee 5aa1106 f4fb9ee 97cdc39 f4fb9ee 7e4f2e7 f4fb9ee 7e4f2e7 f4fb9ee 307c120 f4fb9ee 2443af2 f03ec58 ba0ed2f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
license: mit
language:
- en
---
# RDT-1B
RDT-1B is a 1B-parameter imitation learning Diffusion Transformer pre-trained on 1M+ multi-robot episodes. Given a language instruction and 3-view RGB image observations, RDT can predict the next
64 robot actions. RDT is inherently compatible with almost all kinds of modern mobile manipulators, from single-arm to dual-arm, joint to EEF, pos. to vel., and even with a mobile chassis.
All the [code]() and pretrained model weights are licensed under MIT license.
Please refer to our [project page](https://rdt-robotics.github.io/rdt-robotics/) and [paper]() for more information.
## Model Details
- **Developed by** RDT team from Tsinghua University
- **License:** MIT
- **Language(s) (NLP):** en
- **Model Architecture:** Diffusion Transformer
- **Pretrain dataset:** Curated pretrain dataset collected from 46 datasets. Please see [here]() for detail
- **Repository:** [repo_url]
- **Paper :** [paper_url]
- **Project Page:** https://rdt-robotics.github.io/rdt-robotics/
## Uses
RDT takes language instruction, image observations and proprioception as input, and predicts the next 64 robot actions in the form of unified action space vector.
The unified action space vector includes all the main physical quantities of robots (e.g. the end-effector and joint, position and velocity, base movement, etc.) and can be applied to a wide range of robotic embodiments.
The pre-trained RDT model can be fine-tuned for specific robotic embodiment and deployed on real-world robots.
Here's an example of how to use the RDT-1B model for inference on a Mobile-ALOHA robot:
```python
# Clone the repository and install dependencies
from scripts.agilex_model import create_model
# Names of cameras used for visual input
CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist']
config = {
'episode_len': 1000, # Max length of one episode
'state_dim': 14, # Dimension of the robot's state
'chunk_size': 64, # Number of actions to predict in one step
'camera_names': CAMERA_NAMES,
}
pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384"
# Create the model with specified configuration
model = create_model(
args=config,
dtype=torch.bfloat16,
pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path,
control_frequency=25,
)
# Start inference process
# Load pre-computed language embeddings
lang_embeddings_path = 'your/language/embedding/path'
text_embedding = torch.load(lang_embeddings_path)['embeddings']
images: List(PIL.Image) = ... # The images from last 2 frame
proprio = ... # The current robot state
# Perform inference to predict the next chunk_size actions
actions = policy.step(
proprio=proprio,
images=images,
text_embeds=text_embedding
)
```
RDT-1B supports finetuning on custom dataset, deploying and inferencing on real-robots, as well as pretraining the model.
Please refer to [our repository](https://github.com/GeneralEmbodiedSystem/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md) for all the above guides.
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed] |