RT-1: Robotics Transformer

RT-1 Teaser

Introduction

Robotic learning often involves collecting task-specific data, similar to traditional supervised learning. Recent advances in other domains show the benefits of large, general models pre-trained on broad datasets. This raises the question of whether a single, multi-task backbone model can be trained on diverse robotic tasks for zero-shot generalization. Challenges include assembling broad, connected datasets and designing high-capacity, real-time models.

RT-1 (Robotics Transformer 1) is a model trained on a large, diverse dataset of real-world robotic tasks. RT-1 shows significantly improved generalization and robustness compared to prior techniques, achieving high success rates on training instructions and better generalization to new tasks and environments.

Method

1. Input Processing

Images: 6 images (300×300 resolution) as a history.
Instruction: A natural language task description (e.g., “pick up the cup”).

RT-1 Demo

2. Image and Instruction Tokenization

Image Encoding: Pretrained EfficientNet-B3 → 9×9×512 → 81 visual tokens.
Instruction Embedding: Using Universal Sentence Encoder + FiLM layers.
Output: 81 vision-language tokens per image, task-conditioned.

3. Token Compression with TokenLearner

Attention-based module reduces 81 tokens to 8 per image.
For 6 images → 48 tokens total.

4. Transformer Backbone

Decoder-only Transformer
8 self-attention layers
19M parameters
Outputs sequence of action tokens.

5. Action Tokenization

256 bins across 11 dimensions:

Arm Movement (7): x, y, z, roll, pitch, yaw, gripper.
Base Movement (3): x, y, yaw.
Mode (1): arm/base/terminate.

6. Loss Function

Categorical cross-entropy with causal masking.

Inference Speed Optimizations

TokenLearner: 2.4× faster inference.
Token Reuse: 1.7× faster via overlapping windows.
Total Inference Time: <100ms (~3Hz real-time control).

Key Statistics

Total Parameters: ~35M
Input Tokens: 48
Output: Tokenized actions
Inference Time: <100ms

The RT-1 architecture balances efficiency, real-time performance, and task generalization, making it highly effective for robotic manipulation tasks.

RT-1 YouTube Demo

Figure 2: RT-1 with Google’s Demo.

Code

def forward(self, return_attention_scores):
    network = transformer.Transformer(
        num_layers=2,
        layer_size=512,
        num_heads=4,
        feed_forward_size=256,
        dropout_rate=0.1,
        vocab_size=self._vocab_size,
        return_attention_scores=return_attention_scores)

    output_tokens, attention_scores = network(self._tokens, attention_mask=None)

About Us

Courses

Services

Blogs

Contact