A Small Capability Probe for Agentic Engineering

My plan for testing what agents can actually do, instead of writing another setup tour.

I want this site to start with experiments, not opinions. The shape I am interested in is small and repeatable: give a model or coding agent a task that looks simple to humans, measure exactly where it succeeds, and write down the surprising part.

The first probe I am considering is:

Pick a task with a crisp pass/fail outcome.
Run it across a few models or agent workflows.
Keep the prompt, environment, and scoring public.
Publish the failure cases, not just the best run.

The goal is not to build a grand benchmark. The goal is to create one useful measurement that other engineers can argue with, rerun, or extend.

Candidate Tasks

Can an agent make a targeted multi-file refactor while preserving public behavior?
Can a model infer a hidden rule from examples, then generate adversarial examples that break its own solution?
Can an agent repair a spreadsheet model after only reading user-facing symptoms?
Can an LLM write a deterministic parser for a tiny spec it has never seen before?

What I Want to Learn

I care less about whether a single model “wins” and more about the texture of the failure. Did the agent misunderstand the goal, lose track of constraints, overfit to examples, or make a tool-use mistake? Those details are where the useful engineering lessons are.

This post is currently a working note. The finished version should include a runnable repo, raw logs, a small visualization, and a clear conclusion. That is the bar I want for the site: fewer generic takes, more concrete evidence.