I want this site to start with experiments, not opinions. The shape I am interested in is small and repeatable: give a model or coding agent a task that looks simple to humans, measure exactly where it succeeds, and write down the surprising part.
The first probe I am considering is:
- Pick a task with a crisp pass/fail outcome.
- Run it across a few models or agent workflows.
- Keep the prompt, environment, and scoring public.
- Publish the failure cases, not just the best run.
The goal is not to build a grand benchmark. The goal is to create one useful measurement that other engineers can argue with, rerun, or extend.
Candidate Tasks
- Can an agent make a targeted multi-file refactor while preserving public behavior?
- Can a model infer a hidden rule from examples, then generate adversarial examples that break its own solution?
- Can an agent repair a spreadsheet model after only reading user-facing symptoms?
- Can an LLM write a deterministic parser for a tiny spec it has never seen before?
What I Want to Learn
I care less about whether a single model “wins” and more about the texture of the failure. Did the agent misunderstand the goal, lose track of constraints, overfit to examples, or make a tool-use mistake? Those details are where the useful engineering lessons are.
This post is currently a working note. The finished version should include a runnable repo, raw logs, a small visualization, and a clear conclusion. That is the bar I want for the site: fewer generic takes, more concrete evidence.