FAST: A New Tokenizer for Efficient and Dexterous Robotic Control

Models & Research

The Engineer

17 Jan 2025 · 3 min read

FAST breaks down robotic actions into manageable tokens, streamlining training for Transformers and paving the way for more agile and responsive robot control in real-world scenarios.

In a recent paper, researchers from Physical Intelligence introduced FAST (Fast Action Tokenization), a novel tokenizer designed to enable efficient training of Transformers for robotic control tasks. This work addresses the challenge of representing robot actions in a way that is both computationally efficient and capable of handling complex, high-frequency movements.

The Challenge of Robotic Action Tokenization

Most foundation models today use the Transformer architecture, which processes data as sequences of discrete tokens. These tokens can represent anything from text to images, but for robotic control, they need to capture the intricate actions a robot performs. Existing vision-language-action (VLA) models typically use simple discrete binning to represent these actions, where each dimension of an action step is assigned to a discrete bin. While this approach works for basic tasks, it falls short when dealing with more complex and dexterous skills that require high precision and frequency.

Introducing FAST

FAST addresses these limitations by providing a more efficient and effective way to tokenize robot actions. Here’s how it works:

Continuous Compression: Inspired by methods used in image compression (like JPEG), FAST compresses continuous action sequences into discrete tokens. This approach reduces the computational overhead while preserving the essential information needed for dexterous control.
Action Chunks: FAST represents actions as "action chunks," which are short sequences of robot movements. These chunks can range from 3-5 actions for simpler tasks to 20-50 actions for more complex, high-frequency maneuvers.
Self-Supervised Learning: The tokenizer is trained in a self-supervised manner, meaning it learns the optimal way to compress and decompress action sequences without needing labeled data. This makes it scalable and adaptable to various robotic systems.

Key Benefits

Efficiency: FAST significantly reduces the computational cost of training Transformers for robotic control, making it feasible to train on large datasets.
Dexterity: Unlike simple binning methods, FAST retains the precision needed for complex tasks, ensuring that robots can perform dexterous movements effectively.
Generalization: The model generalizes well to new environments and tasks, which is crucial for real-world applications.

Experimental Results

The researchers evaluated FAST on a variety of robotic systems, including single-arm and bimanual setups:

UR5e
Bimanual UR5e
Franka
Bimanual Trossen
Bimanual Arx
Mobile Trossen
Mobile Fibocom

The tasks included a range of dexterous activities such as folding shirts, bus table settings, folding laundry, bagging groceries, and removing toast. FAST outperformed existing methods in terms of both efficiency and task performance.

Generalization to New Environments

One of the key strengths of FAST is its ability to generalize to new environments. The researchers tested the model in several different kitchen setups, including:

Berkeley Kitchen
Berkeley Counter
Stanford Lab
UW Lab

In all these settings, FAST demonstrated robust performance, indicating its potential for real-world deployment.

Architecture and Implementation Details

Transformer-based VLA Model: The π0-FAST model is an autoregressive Transformer that processes input from cameras and language instructions to predict the next action token.
Action Tokens: Each action token represents a compressed chunk of robot actions, which are decompressed by the tokenizer for execution.
Training: The model is trained using self-supervised learning on large datasets of robotic demonstrations.

Conclusion

FAST marks a significant step forward in the field of robotic control. By providing an efficient and effective way to tokenize robot actions, it enables Transformers to handle complex, dexterous tasks with high precision and speed. This opens up new possibilities for training generalist policies that can perform a wide range of tasks in various environments.