
A benchmark perspective on whether models can move beyond static answers and begin contributing to the iterative experimental loops that actually produce scientific and engineering progress.
💡Quick take https://autolab.moe/
-
Most benchmarks test one-shot correctness. But real progress emerges from cycles of failure, diagnosis, and revision under limited compute and noisy feedback.
-
AutoLab replaces the exam with the laboratory: no answer key, just a compute budget and a path that must be discovered empirically by running, measuring, and revising under real constraints.
-
Across 23 tasks, the capability that separates success from failure is closed-loop resilience: the ability to survive negative empirical feedback, update hypotheses, and restructure the approach.

https://arxiv.org/abs/2310.06770
Frontier research progress is driven by loops, not isolated outputs.
Progress is not a single brilliant insight generated in a zero-shot prompt. It is ground out through trial, measurement, and revision.
Most benchmarks test whether a model can produce the right output on the first try: the right code, the right reasoning chain, the right answer. But frontier AI is not built from isolated answers. It is built from loops of trial, measurement, interpretation, and revision, often running continuously, day and night.
Training recipes are tuned through hundreds of small experiments. System performance is improved through countless interventions. Architecture choices are shaped through loops of modification and measurement, all under constraints: finite compute, limited time, incomplete information, and noisy feedback. Evaluating models on this kind of work asks whether they can gather the empirical evidence required to actually find the answer, not just produce one on the first try.
Beyond the answer engine
If a model only completes isolated subtasks, its impact is fundamentally bounded. By the Theory of Constraints, speeding up one isolated component hits a ceiling because the bottleneck simply moves elsewhere.
To achieve an order-of-magnitude improvement, you have to restructure the system itself. Version control is a clear example: branching and merging removed the constraint of sequential coordination, enabling parallel experimentation not by making editing faster, but by eliminating the bottleneck entirely.
When models enter the experimental loop, something analogous happens. The bottleneck in research has long been human iteration speed: the number of experiment cycles a researcher can run in a day is small and fixed. An agent can run an order of magnitude more, and it can run around the clock. This does not just speed up one step; it removes the human iteration bottleneck from the critical path entirely, changing the structure of the work itself.
That is a different claim from saying models are becoming more intelligent. The shift is that they are gaining the ability to function inside the iterative machinery of discovery: not just reason about experiments, but execute them.
Why we built AutoLab
AutoLab is a benchmark for participation in experimental loops, not just for static knowledge or isolated reasoning.
If agents are becoming structural participants in the research loop, running hundreds of experiments overnight and iterating continuously without human intervention, we need a way to measure exactly that capability. That is the motivation behind AutoLab.
Each task gives the agent a working but unoptimized program, a fixed compute budget, and a target metric. There is no answer key. The path forward must be discovered empirically: profiling, hypothesizing, modifying code, running experiments, and iterating under real constraints.
AutoLab tests whether a model can do the unglamorous but essential work of empirical improvement: not just thinking about change, but executing the loop through which change is found.
The emerging landscape
The idea of agents running continuous optimization loops is already producing results. Karpathy's autoresearch [1] ran an agent for two days straight: 700 experiments, 20 discovered optimizations, an 11% training speedup on a project he considered already well-tuned. AlphaEvolve [2] (DeepMind, 2025) takes this further: LLMs paired with evolutionary search discovered the first improvement to matrix multiplication beyond Strassen in 56 years, now recovering 0.7% of Google's worldwide compute.
But how do we measure this capability? RE-Bench [3] (METR, 2024) compares agents against human experts on 8-hour research tasks. KernelBench [4] (Stanford, 2025) evaluates LLMs on 250 GPU kernel optimization tasks. COFFE [5] introduced instruction counting to eliminate machine noise. The AI Scientist [6] (Nature, 2026) generates research papers end-to-end. AutoLab builds on ideas from several of these, particularly COFFE’s instruction counting [5], and provides a standardized benchmark for the kind of continuous optimization loop that autoresearch [1] and AlphaEvolve [2] have demonstrated.

Design and task space
AutoLab contributes a facet that is currently underrepresented: tasks with large, under-explored solution spaces where the agent must discover its own path. A few examples illustrate the design intent:
Combinatorial search. Find a 16-input sorting network with the fewest comparators. The best known result (60) dates to the 1960s and has never been proven optimal.
Algorithmic reinvention. Optimize an attention kernel by re-deriving blocked tiling, online softmax, and SIMD vectorization from first principles under a single-threaded, C-only constraint.
Inverted objectives. Compress a 17,924-parameter game-playing network to achieve the same accuracy with as few parameters as possible. The reference solution reaches 913 parameters, an approach that emerges from rethinking the architecture, not from applying standard compression techniques.
Open-ended ML research. Select the best 5,000 training samples from a pool of 50,000 for instruction-following fine-tuning, or train a language model from scratch with the lowest possible perplexity under a fixed compute budget.
The first release includes 23 tasks spanning systems engineering (SIMD vectorization in C, cache-aware data structures in C++, lock-free concurrency in Go, zero-copy parsing in Rust) and frontier AI research (compute-optimal pretraining, data selection, multi-source GRPO fine-tuning). What unifies them is the structure: every task presents an open search space where progress must be discovered through iterative experimentation.
The three axes of improvement
https://arxiv.org/abs/2411.15114
These tasks cluster along three dimensions. The objectives differ, but the loop is the same: propose, test, measure, revise.
Faster
Reduce latency through systems or architectural changes while preserving quality. Example: flash attention baseline → agent achieves 30× speedup via cache-aware tiling and SIMD vectorization.
Smarter
Improve task performance under a fixed compute budget. Example: data selection for IFEval, where the best agent reaches 46.2% strict accuracy vs. 37.8% random baseline.
Smaller
Find a more efficient solution without sacrificing correctness. Example: Connect-3 player compressed from 17,924 → 913 parameters via architectural restructuring.
Case study
The difference between solving a coding problem and doing research is the ability to survive the friction of empirical failure.
The concept is easier to see in practice. Below, we trace how frontier models approached two AutoLab tasks, not to rank which model is better, but to illustrate the specific behaviors that separate loop participation from one-shot generation.
Data Selection


The random baseline (selecting 5,000 samples uniformly) already achieves 37.8%. The gap between random and the best agent is just 8.4 percentage points. That narrow margin is the point: in data selection, most of the value comes not from having a clever initial idea, but from iteratively refining which samples matter based on empirical training signal.
The top-scoring agents wrote an initial selection heuristic, triggered a training run, and inspected the evaluation breakdown. When the model failed on specific constraint types (length limits, punctuation rules), they parsed the error distribution, identified which training samples addressed those failures, and rewrote the selection criteria. This diagnose-and-revise cycle, not the initial heuristic, is where the score differential was earned.
https://github.com/karpathy/autoresearch
By treating each training run as a diagnostic, not just a pass/fail check, the agent mapped which constraint types the model struggled with and shaped the data to address those gaps specifically. The 8.4-point margin is thin, but it represents exactly the kind of empirically-grounded refinement that separates loop participation from one-shot generation.
Connect-3 Parameter Golf
https://arxiv.org/abs/2502.10517

Unlike latency tasks where every percentage counts, this task has a phase transition: the structural insight that unlocks a compact representation is discrete, producing the sharp 1.0-vs-0.0 split above. What makes the trajectories instructive is how the loop strategies diverged.
Claude 4.6 approached the constraint using standard engineering intuition. When its initial MLP failed to generalize, it noticed the horizontal flip symmetry of the game board and augmented the training data. A clever data-side fix, but it kept pushing raw features through a flat structure, never questioning the topology itself. The parameter count stayed too high.

Claude correctly diagnosed the data bug, but its subsequent loop stayed within the same structural paradigm: shrinking layer widths, adjusting activation functions, retraining. Each iteration produced incremental compression without the architectural pivot needed to cross the threshold. This is a revealing failure mode: the loop ran, but it explored a local neighborhood rather than restructuring.
GPT-5.4 experienced the critical shift. After recognizing that flat dense layers were fundamentally parameter-inefficient for evaluating multiple columns, it reorganized the network topology entirely, constructing a shared action-scorer whose weights were reused across all columns simultaneously.

By reusing weights across potential actions instead of adding more layers, GPT-5.4 bypassed the capacity constraint with a structural insight, reaching 3,201 parameters at 95.17% accuracy. The difference was not intelligence or reasoning quality in isolation, but what happened at the decision point when incremental refinement stopped working: Claude kept optimizing within the existing frame, GPT-5.4 restructured the frame itself.
What we are actually measuring
These two tasks test very different skills: statistical intuition in one case, architectural reasoning in the other. But the capability that separated success from failure was the same: the ability to survive negative empirical feedback and change strategy in response. We call this closed-loop resilience.
Closed-loop resilience is distinct from the capabilities that most benchmarks measure. It is not about correctness on a first attempt, breadth of knowledge, or reasoning chain quality in isolation. It is the compound skill of staying oriented inside an iterative process: interpreting ambiguous signals, recognizing when a current approach has saturated, and deciding whether to refine within the existing frame or restructure entirely.
Three design choices make this capability measurable across the full benchmark:
Continuous scoring, not pass/fail. Reward is log-scaled: a 10× speedup earns credit even if 30× is possible. This captures partial progress, because every loop iteration should move the needle.
Reproducible metrics via instruction counting. Inspired by COFFE [5], many tasks use Valgrind/callgrind to count CPU instructions instead of wall-clock time, eliminating machine noise entirely.
Metric diversity beyond speed. Some tasks minimize model parameters, comparator counts, or memory allocations, inverting the typical optimization objective and testing whether the agent can adapt its loop strategy to fundamentally different reward signals.
The scientist does not disappear
Consider a concrete scenario: an agent runs fifty hyperparameter sweeps overnight and reports a configuration with the lowest validation loss. A human researcher glances at the loss curve, recognizes overfitting on a mislabeled data split, rejects the result, and redirects the search. That judgment call, knowing when a metric is misleading, is not something the loop itself can provide.The broader question is whether research itself changes when machines begin to participate in experimental loops. AutoLab is an instrument for studying that transition: not a grand claim that autonomy has arrived, but a controlled environment where we can measure whether agents contribute to the loop under real constraints. The scientist does not disappear. The loop gets a new participant.
A living benchmark
The 23 tasks here are a starting point. AutoLab is designed for community contribution, and we are actively inviting researchers to submit new tasks across all three axes. If you have an optimization problem with a clear metric, an open search space, and no known optimal solution, it probably belongs here.Submit tasks and run agents via Contribute a Task. Each task needs an instruction file, a Dockerfile, a test script that writes a reward, and optionally a reference solution. The benchmark grows with every contribution.

References
[1] Karpathy, A. (2026). autoresearch: AI agents running research on single-GPU nanochat training automatically. github.com/karpathy/autoresearch

[2] Novikov, A., Vu, N., Eisenberger, M., et al. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. Google DeepMind. arXiv preprint arXiv:2506.13131. arxiv.org/abs/2506.13131
[3] Wijk, H., Lin, T., Becker, J., et al. (2024). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts. ICML 2025. arxiv.org/abs/2411.15114
[4] Ouyang, A., Guo, S., Arora, S., et al. (2025). KernelBench: Can LLMs write efficient GPU kernels? arXiv preprint arXiv:2502.10517. arxiv.org/abs/2502.10517
[5] Peng, Y., Wan, J., Li, Y., Ren, X. (2025). COFFE: A code efficiency benchmark for code generation. FSE 2025. ACM SIGSOFT Distinguished Paper Award. arxiv.org/abs/2502.02827
[6] Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., Ha, D. (2026). Towards end-to-end automation of AI research. Nature, 651, 914–919. doi.org/10.1038/s41586-026-10265-5
[7] Jimenez, C.E., Yang, J., Wettig, A., et al. (2024). SWE-bench: Can language models resolve real-world GitHub issues? ICLR 2024. arxiv.org/abs/2310.06770
[8] Huang, Q., Vora, J., Liang, P., Leskovec, J. (2024). MLAgentBench: Evaluating language agents on machine learning experimentation. ICML 2024. arxiv.org/abs/2310.03302

[9] Goldratt, E.M. (1984). The Goal: A Process of Ongoing Improvement. North River Press.