Scaling Autoresearch with 16 GPUs: A Deep Dive into Parallel Experimentation

Models & Research

The Engineer

20 Mar 2026 · 3 min read

This experiment demonstrates how parallel computing can supercharge AI research, showing that increasing model width outperforms tweaking individual hyperparameters and leads to a significant boost in efficiency with heterogeneous hardware.

In a recent experiment, we leveraged Claude Code and the autoresearch framework to explore how parallelism can significantly enhance AI research. By providing access to 16 GPUs on a Kubernetes cluster, we managed to run approximately 910 experiments over just 8 hours. This intensive exploration revealed that scaling model width was more impactful than any single hyperparameter adjustment. The agent also learned to optimize its use of heterogeneous hardware, driving the val_bpb metric from 1.003 down to 0.974-a 2.87% improvement over the baseline.

How Autoresearch Works

Autoresearch is an automated framework designed to streamline AI model development by systematically exploring different configurations and hyperparameters. It operates in a loop: propose, run, evaluate, and refine. The goal is to find the best-performing model configuration with minimal human intervention.

The Bottleneck: One GPU, One Experiment

Traditionally, running experiments on a single GPU limits the exploration pace. Each experiment must be completed before the next one can start, leading to a sequential, greedy hill-climbing approach. This method is slow and often misses out on interaction effects between parameters.

Giving the Agent Cloud GPUs

By providing the autoresearch agent with access to 16 GPUs, we transformed its strategy. Instead of running experiments sequentially, it could now execute multiple experiments in parallel. This change allowed for more efficient exploration of the parameter space.

Parallel Execution: The agent ran factorial grids of 10-13 experiments per wave.
Interaction Effects: Catching interaction effects between parameters that sequential search would miss.
Efficient Screening: Testing six model widths in a single wave, identifying trends quickly, and focusing on the best configurations.

Results: ~910 Experiments, ~8 Hours, 16 GPUs

Phase 1: Hyperparameter Sweeps (~First 200 Experiments)

The initial phase focused on exploring different hyperparameters. The agent tested various learning rates, batch sizes, and regularization techniques to identify the most promising configurations.

Phase 2: Architecture Discovery (~Experiments 200-420)

Next, the agent delved into model architecture discovery. It experimented with different layer types, depths, and widths to find the optimal structure for the task.

Phase 3: Fine-Tuning the Wider Model (~Experiments 420-560)

With a promising base model identified, the agent fine-tuned it by adjusting specific parameters to further improve performance.

Phase 4: Optimizer Tuning (~Experiments 560-700)

The agent then focused on optimizing the training process by experimenting with different optimizers and their settings.

Phase 5: Diminishing Returns (~Experiments 700-910)

As the number of experiments increased, the marginal gains started to diminish. However, the agent continued to refine its configurations for incremental improvements.

Best Configuration

The best configuration achieved a val_bpb score of 0.974, representing a significant improvement over the baseline.

How Parallelism Changed the Agent’s Research Strategy

Parallel execution fundamentally altered how the agent approached experimentation:

Factorial Grids: Running multiple experiments simultaneously allowed the agent to explore complex interactions between parameters.
Efficient Screening: The ability to test multiple configurations in parallel enabled rapid identification of promising models.
Heterogeneous Hardware Utilization: The agent learned to leverage different GPU types (H100s and H200s) effectively. It used cheaper H100s for initial screening and more powerful H200s for validation.

Emergent Research Strategies: Exploiting Heterogeneous Hardware

The agent's ability to adapt to heterogeneous hardware led to an emergent strategy:

Screening on H100s: Initial experiments were run on less expensive H100 GPUs to quickly filter out poor configurations.
Validation on H200s: Promising models were then validated on more powerful H200 GPUs to ensure high performance.

Cost

The cost of running 910 experiments over 8 hours using 16 GPUs was a key consideration. While the exact