
Share
This experiment demonstrates how parallel computing can supercharge AI research, showing that increasing model width outperforms tweaking individual hyperparameters and leads to a significant boost in efficiency with heterogeneous hardware.
In a recent experiment, we leveraged Claude Code and the autoresearch framework to explore how parallelism can significantly enhance AI research. By providing access to 16 GPUs on a Kubernetes cluster, we managed to run approximately 910 experiments over just 8 hours. This intensive exploration revealed that scaling model width was more impactful than any single hyperparameter adjustment. The agent also learned to optimize its use of heterogeneous hardware, driving the val_bpb metric from 1.003 down to 0.974-a 2.87% improvement over the baseline.
Autoresearch is an automated framework designed to streamline AI model development by systematically exploring different configurations and hyperparameters. It operates in a loop: propose, run, evaluate, and refine. The goal is to find the best-performing model configuration with minimal human intervention.
Traditionally, running experiments on a single GPU limits the exploration pace. Each experiment must be completed before the next one can start, leading to a sequential, greedy hill-climbing approach. This method is slow and often misses out on interaction effects between parameters.
By providing the autoresearch agent with access to 16 GPUs, we transformed its strategy. Instead of running experiments sequentially, it could now execute multiple experiments in parallel. This change allowed for more efficient exploration of the parameter space.
The initial phase focused on exploring different hyperparameters. The agent tested various learning rates, batch sizes, and regularization techniques to identify the most promising configurations.
Next, the agent delved into model architecture discovery. It experimented with different layer types, depths, and widths to find the optimal structure for the task.

With a promising base model identified, the agent fine-tuned it by adjusting specific parameters to further improve performance.
The agent then focused on optimizing the training process by experimenting with different optimizers and their settings.
As the number of experiments increased, the marginal gains started to diminish. However, the agent continued to refine its configurations for incremental improvements.
The best configuration achieved a val_bpb score of 0.974, representing a significant improvement over the baseline.
Parallel execution fundamentally altered how the agent approached experimentation:
The agent's ability to adapt to heterogeneous hardware led to an emergent strategy:
The cost of running 910 experiments over 8 hours using 16 GPUs was a key consideration. While the exact
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
20 March 2026
133 articles
Related Articles
Related Articles
More Stories