One post tagged with "machine-learning"

Auto-Overfit: Measuring and Reducing Overfitting in Autoresearch

April 24, 2026 · 12 min read

@travispchen

This post is published from the autoresearch-overfit repository, which contains the full harness, replay code, and per-task results.

Karpathy's autoresearch is a simple, compelling pattern: an LLM agent edits train.py, runs a fixed training budget, accepts or rejects the edit based on a validation metric on a holdout slice, on repeat. You sleep and wake up to a better model.

The structural worry is adaptive holdout reuse¹, as every accept/reject decision leaks information from the val set into the agent's trajectory, even if the val set is different from the train set. After many experiments, val error starts acting like train error. The original autoresearch repository had a crystal clear example in autoresearch issue #131: the last accepted "win" in an overnight run turned out to be a random seed change.

I also noticed in my own quantitative research projects that this pattern was a problem. For instance, given a canonical basketball analytics problem like RAPM (predicting the points scored on a possession given the 5 offensive and 5 defensive players), an autoresearch agent started creating features out of pairs of players, a clear overfitting signal on a small dataset.

This study measures the effect on tabular ML tasks from OpenML² with an XGBoost baseline³.

Headline results:

The basic autoresearch loop overfits to validation "wins" that do not show up on test. Val improved +0.019 while test moved essentially 0 (-0.0006) on average. The val-test gap is +0.020 (paired p = 0.001).
Requiring a significant effect size to beat the validation noise floor fixes most of the problem. Accepting only changes with Δval > 1·σ_val cuts the mean overfit gap to +0.005 (p = 0.004) and flips mean Δtest from -0.0006 to +0.003. The Δtest improvement alone is not significant at n = 15 (one-sided p = 0.06) but the gap reduction is.
Asking the LLM to reflect on the source of overfitting did not help. Showing the proposer train/val diagnostics and asking it to avoid likely noise wins produced gap +0.021 (p = 0.75) and Δtest -0.003, neither statistically different from the plain val-gate baseline.