
Share
Researchers trained a large Llama model using PyTorch’s fault-tolerance tools, simulating 2000 node failures every 15 seconds without relying on checkpoints, showcasing resilience in unreliable environments.
We recently pushed the boundaries of fault-tolerant training by running a large-scale Llama model under extreme conditions. Using PyTorch’s torchft and torchtitan, we demonstrated that it's possible to train models in highly unreliable environments without relying on checkpoints. This is particularly relevant for real-world deployments where node failures are common.
The key technical advancements here are:
For practitioners, this means:
The training job is structured as follows:

Fault Tolerant HSDP:
LocalSGD/DiLoCo:
We ran the training job on Crusoe L40S GPUs with the following setup:
The training loss graph shows that the model maintained its performance despite frequent worker recoveries. Each small spike represents a non-participating worker recovering, which affects metrics but not the overall model accuracy.
This experiment demonstrates the robustness of torchft and torchtitan in handling highly unreliable environments. For practitioners looking to deploy models in real-world scenarios where node failures are common, this setup provides a reliable and efficient solution without the need for frequent checkpoints.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
27 June 2025
88 articles
Related Articles
Related Articles
More Stories