
Share
FAST breaks down robotic actions into manageable tokens, streamlining training for Transformers and paving the way for more agile and responsive robot control in real-world scenarios.
In a recent paper, researchers from Physical Intelligence introduced FAST (Fast Action Tokenization), a novel tokenizer designed to enable efficient training of Transformers for robotic control tasks. This work addresses the challenge of representing robot actions in a way that is both computationally efficient and capable of handling complex, high-frequency movements.
Most foundation models today use the Transformer architecture, which processes data as sequences of discrete tokens. These tokens can represent anything from text to images, but for robotic control, they need to capture the intricate actions a robot performs. Existing vision-language-action (VLA) models typically use simple discrete binning to represent these actions, where each dimension of an action step is assigned to a discrete bin. While this approach works for basic tasks, it falls short when dealing with more complex and dexterous skills that require high precision and frequency.
FAST addresses these limitations by providing a more efficient and effective way to tokenize robot actions. Here’s how it works:
The researchers evaluated FAST on a variety of robotic systems, including single-arm and bimanual setups:

The tasks included a range of dexterous activities such as folding shirts, bus table settings, folding laundry, bagging groceries, and removing toast. FAST outperformed existing methods in terms of both efficiency and task performance.
One of the key strengths of FAST is its ability to generalize to new environments. The researchers tested the model in several different kitchen setups, including:
In all these settings, FAST demonstrated robust performance, indicating its potential for real-world deployment.
FAST marks a significant step forward in the field of robotic control. By providing an efficient and effective way to tokenize robot actions, it enables Transformers to handle complex, dexterous tasks with high precision and speed. This opens up new possibilities for training generalist policies that can perform a wide range of tasks in various environments.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
17 January 2025
88 articles
Related Articles
Related Articles
More Stories