2025-04-24
Deep Learning

RT-1: Robotics Transformer

RT-1 Teaser

Introduction

Robotic learning often involves collecting task-specific data, similar to traditional supervised learning. Recent advances in other domains show the benefits of large, general models pre-trained on broad datasets. This raises the question of whether a single, multi-task backbone model can be trained on diverse robotic tasks for zero-shot generalization. Challenges include assembling broad, connected datasets and designing high-capacity, real-time models.

RT-1 (Robotics Transformer 1) is a model trained on a large, diverse dataset of real-world robotic tasks. RT-1 shows significantly improved generalization and robustness compared to prior techniques, achieving high success rates on training instructions and better generalization to new tasks and environments.

Method

1. Input Processing

  • Images: 6 images (300×300 resolution) as a history.
  • Instruction: A natural language task description (e.g., “pick up the cup”).

RT-1 Demo

2. Image and Instruction Tokenization

  • Image Encoding: Pretrained EfficientNet-B3 → 9×9×512 → 81 visual tokens.
  • Instruction Embedding: Using Universal Sentence Encoder + FiLM layers.
  • Output: 81 vision-language tokens per image, task-conditioned.

3. Token Compression with TokenLearner

Attention-based module reduces 81 tokens to 8 per image.
For 6 images → 48 tokens total.

4. Transformer Backbone

  • Decoder-only Transformer
  • 8 self-attention layers
  • 19M parameters
  • Outputs sequence of action tokens.

5. Action Tokenization

256 bins across 11 dimensions:

  • Arm Movement (7): x, y, z, roll, pitch, yaw, gripper.
  • Base Movement (3): x, y, yaw.
  • Mode (1): arm/base/terminate.

6. Loss Function

Categorical cross-entropy with causal masking.

Inference Speed Optimizations

  • TokenLearner: 2.4× faster inference.
  • Token Reuse: 1.7× faster via overlapping windows.
  • Total Inference Time: <100ms (~3Hz real-time control).

Key Statistics

  • Total Parameters: ~35M
  • Input Tokens: 48
  • Output: Tokenized actions
  • Inference Time: <100ms

The RT-1 architecture balances efficiency, real-time performance, and task generalization, making it highly effective for robotic manipulation tasks.

RT-1 YouTube Demo

RT-1 YouTube Demo
Figure 2: RT-1 with Google’s Demo.

Code

def forward(self, return_attention_scores):
    network = transformer.Transformer(
        num_layers=2,
        layer_size=512,
        num_heads=4,
        feed_forward_size=256,
        dropout_rate=0.1,
        vocab_size=self._vocab_size,
        return_attention_scores=return_attention_scores)

    output_tokens, attention_scores = network(self._tokens, attention_mask=None)
Nếu bạn đang tìm kiếm lộ trình học Data/AI bài bản từ con số 0, hãy tham gia cùng GeekieSeoul. Chúng tôi luôn sẵn sàng đồng hành cùng bạn trên hành trình khám phá và làm chủ trí tuệ nhân tạo – dù bạn là người mới bắt đầu hay đang định hướng sự nghiệp trong lĩnh vực này.