Enhancing Coding Agents with Research-Driven Optimizations

Models & Research

The Engineer

10 Apr 2026 · 3 min read

Researchers at SkyPilot show coding agents can boost their optimization skills by first delving into existing literature and competing projects, leading to substantial improvements within hours.

In a recent experiment, researchers at SkyPilot demonstrated that coding agents can produce more effective optimizations when they first conduct a literature search and study competing projects. By integrating this research phase into the autoresearch loop (using tools like autoresearch and pi-autoresearch), they were able to achieve significant performance gains in just a few hours.

Where Code-Only Context Works

Code-only context is effective for many tasks, especially when the problem domain is well-understood and the codebase is relatively simple. However, as projects grow more complex, the limitations of this approach become apparent. For instance, optimizing a large-scale machine learning model like llama.cpp requires a deep understanding of both the underlying algorithms and the latest research in the field.

Where Code-Only Context Breaks Down

When dealing with advanced models and optimizations, code-only context often falls short:

Lack of Context: Without access to the latest research papers and competing projects, agents may miss out on critical insights.
Suboptimal Solutions: Agents might converge on local optima rather than discovering more efficient or innovative solutions.

Adding a Research Phase

To address these limitations, SkyPilot added a literature search phase to the autoresearch loop. This involves:

Automated Literature Search: The agent queries academic databases and GitHub repositories for relevant papers and projects.
Contextual Analysis: It analyzes the retrieved information to identify key optimizations and techniques.

The Experiment Log

What the Research Turned Up

The literature search revealed several promising optimization techniques, including:

Softmax Fusion
RMS Norm Fusion
Adaptive from_float Parallelization
Graph-Level RMS_NORM + MUL Fusion
Flash Attention KQ Fusion

The Pivot: From Compute to Memory

One of the key insights was the importance of memory optimizations. Traditional approaches often focus on compute efficiency, but modern models can be heavily bottlenecked by memory access patterns.

Optimizations That Landed

Softmax Fusion: Combining the softmax operation with other layers to reduce redundant computations.
RMS Norm Fusion: Fusing RMS normalization with subsequent operations to minimize overhead.
Adaptive from_float Parallelization: Dynamically adjusting parallelization strategies for floating-point conversions based on input size and hardware capabilities.
Graph-Level RMS_NORM + MUL Fusion: Optimizing the computational graph by fusing RMS normalization and multiplication operations.
Flash Attention KQ Fusion: Enhancing the flash attention mechanism by fusing key-query computations.

Results

The experiment, conducted using 4 cloud VMs over approximately 3 hours, produced the following results:

x86: +15% faster text generation for flash attention
ARM: +5% faster text generation for flash attention

These optimizations were applied to the TinyLlama 1.1B model, demonstrating significant performance improvements.

What Didn’t Work

Not all experiments led to successful outcomes:

Failed Experiments: Some techniques that seemed promising in theory did not translate well into practice.
Benchmark Bug: An issue with the benchmarking setup initially skewed results.
Cloud VMs are Noisy: Variability in cloud environments can affect performance measurements.
Code Review: Rigorous code review is essential to ensure the quality and correctness of optimizations.

What This Means for Coding Agents

Integrating a research phase into coding agents' workflows can lead to more effective and innovative optimizations. By leveraging the latest academic and industry insights, these agents can produce results that are both faster and more efficient.

Try It on Your Own Project

The full setup is available for any project with a benchmark and test suite. Whether you're working on machine learning models, web applications, or other software projects, this approach can help you achieve better performance.