RT-1: Robotics Transformer
Introduction
Robotic learning often involves collecting task-specific data, similar to traditional supervised learning. Recent advances in other domains show the benefits of large, general models pre-trained on broad datasets. This raises the question of whether a single, multi-task backbone model can be trained on diverse robotic tasks for zero-shot generalization. Challenges include assembling broad, connected datasets and designing high-capacity, real-time models.
RT-1 (Robotics Transformer 1) is a model trained on a large, diverse dataset of real-world robotic tasks. RT-1 shows significantly improved generalization and robustness compared to prior techniques, achieving high success rates on training instructions and better generalization to new tasks and environments.
Method
1. Input Processing
- Images: 6 images (300×300 resolution) as a history.
- Instruction: A natural language task description (e.g., “pick up the cup”).
2. Image and Instruction Tokenization
- Image Encoding: Pretrained EfficientNet-B3 → 9×9×512 → 81 visual tokens.
- Instruction Embedding: Using Universal Sentence Encoder + FiLM layers.
- Output: 81 vision-language tokens per image, task-conditioned.
3. Token Compression with TokenLearner
Attention-based module reduces 81 tokens to 8 per image.
For 6 images → 48 tokens total.
4. Transformer Backbone
- Decoder-only Transformer
- 8 self-attention layers
- 19M parameters
- Outputs sequence of action tokens.
5. Action Tokenization
256 bins across 11 dimensions:
- Arm Movement (7): x, y, z, roll, pitch, yaw, gripper.
- Base Movement (3): x, y, yaw.
- Mode (1): arm/base/terminate.
6. Loss Function
Categorical cross-entropy with causal masking.
Inference Speed Optimizations
- TokenLearner: 2.4× faster inference.
- Token Reuse: 1.7× faster via overlapping windows.
- Total Inference Time: <100ms (~3Hz real-time control).
Key Statistics
- Total Parameters: ~35M
- Input Tokens: 48
- Output: Tokenized actions
- Inference Time: <100ms
The RT-1 architecture balances efficiency, real-time performance, and task generalization, making it highly effective for robotic manipulation tasks.
RT-1 YouTube Demo
Figure 2: RT-1 with Google’s Demo.
Code
def forward(self, return_attention_scores):
network = transformer.Transformer(
num_layers=2,
layer_size=512,
num_heads=4,
feed_forward_size=256,
dropout_rate=0.1,
vocab_size=self._vocab_size,
return_attention_scores=return_attention_scores)
output_tokens, attention_scores = network(self._tokens, attention_mask=None)