Case studies

The framework applied to real problems

Real examples of Plan, Implement, Review applied to AI-delegated work. Each one shows a specific failure mode and how structured human review caught what implementation alone would have missed.

Security migration — landed in under an hour with proper planning

A multi-tenant auth migration scoped at one to two weeks. The plan absorbed the complexity. The review caught three issues implementation missed.

The problem. A multi-tenant application needed to replace its entire authentication system — the identity provider, the session model, the tenant resolution chain, and every route that touched user identity. The migration was security-sensitive: a mistake in tenant separation would expose one client’s data to another. Under a conventional approach, this would be scoped at one to two weeks of engineering with manual QA.

What the operating model changed. The plan absorbed the complexity before implementation started. Every coupling between the old and new auth system was mapped in text. Non-negotiables were defined: tenant isolation could not degrade, session handling had to be verified against every entry point, and no route could be left on the old identity model. Acceptance criteria were written as tests before a single line of implementation began. Implementation compressed once the brief was stable — the core migration landed in a fraction of the conventional estimate, not because anyone typed faster, but because the specification removed the ambiguity that normally slows engineering down.

What changed. Verification was repeatable. Instead of a developer clicking through flows and remembering what worked, every critical path had an automated check. Failures were concrete and specific, not “I think this might be broken.” The review checkpoint caught three issues that implementation alone missed — including a subtle tenant-scoping bug that would only have surfaced in production under a specific authentication flow.

AI model selection — a structured trial instead of a blind production swap

A provider was retiring the model behind a daily news-intelligence pipeline, so a change was forced. Five candidates ran in parallel on live traffic against a scoring gate fixed in advance — and the same discipline surfaced a silent failure the production path had hidden.

The problem. A business ran a daily news-intelligence pipeline that read roughly 500 articles a day through a language model to extract structured data — the organisations named, the relationships between them, and a relevance score for each. The provider was retiring that model on a fixed date, so a change was not optional. But there is no staging environment for judgement: whether a model correctly reads who did what to whom only becomes visible against real volume. A blind swap to the successor would have changed every output for a full day before anyone could see whether it had improved or degraded them — with no way to compare it against what the old model would have produced on the same inputs.

What the operating model changed. The production model kept serving. Five candidates — the incumbent, two newer cloud models, an open-weight model, and a budget option — ran as non-serving shadows against the same articles, their output namespaced so it never reached anyone. Every metric was computed on the same item each model saw, and each candidate ran on its own billing credential so cost was read from the vendor rather than estimated. The scoring function was fixed before any data arrived: completeness, calibration, usefulness, cost, and reliability, each weighted into one composite, with a promotion threshold and the rules for dropping a weak candidate written down in advance. The trial could not be argued towards a preferred answer because the answer’s definition predated the results.

What changed. The winning model cleared the pre-set gate by a wide margin — a composite of 96.6 against the incumbent’s 79.4 — and was promoted on evidence rather than instinct. The trial was as valuable for what it ruled out: the cheapest candidate was dropped inside two days because everything it found, the incumbent already found — a strict subset, where the saving buys nothing. Beyond the choice itself, the per-model split exposed a silent fault the production path had hidden — a fraction of articles quietly routing to the wrong model and producing nothing, with no error raised — and it surfaced several places where the retiring model was still wired in by default and would have failed at the provider’s cutoff. None of that is an AI feature. The shadow architecture, the rubric fixed in advance, the promotion gate: that is management infrastructure for a system that happens to run on AI, and it was the real deliverable.

System refactoring — AI agent re-scoped itself beyond the agreed brief

An AI execution agent drifted back to cancelled workstreams. Human review against the agreed plan caught it instantly — preserving three rounds of deliberate scope reduction.

The problem. A complex system refactoring had been scoped, reviewed, and deliberately simplified over three rounds of planning. The original scope called for migrating every consumer of a legacy data model to a new one. After careful review, most consumers were already working correctly — the existing display layer was the right interface, not technical debt. The plan was reduced from eleven workstreams to three, with the cancelled items explicitly documented.

What the operating model changed. The implementation was delegated to an AI execution agent against the simplified brief. Partway through, the agent drifted back to the original broader scope. It had identified code using the old model and pattern-matched “old model equals needs migration” — a reasonable inference in isolation, but one that directly contradicted three rounds of deliberate planning. Three cancelled workstreams were quietly re-queued as planned work. The review checkpoint caught it immediately. The reviewer checked the agent’s work queue against the agreed plan and saw scope that had been explicitly removed. The correction was instant: these are cancelled, not planned. The display layer is working as designed.

What changed. Without the structured review, the agent would have spent hours migrating code that was already correct — introducing risk, burning time, and creating a false sense of progress. This is a pattern inherent to AI delegation. The agent optimises for completeness rather than the brief. It sees something that looks wrong and fixes it, regardless of whether the plan agreed it was wrong. Only a review checkpoint anchored to the original plan catches the drift before it compounds. Human review of AI execution is not optional — it is the control surface.

Performance fix — one bug became fourteen through structured review

An AI sub-agent scanned for siblings of a single performance issue. Found fourteen instances. Ranked them by business impact. The pattern definition became an automated CI gate.

The problem. A performance audit identified a slow database query in one function. The conventional response would be to fix that function, verify the fix, and move on. The function was part of a repository layer containing dozens of similar functions, all written in the same period, all following the same conventions.

What the operating model changed. Instead of fixing the single instance, a fresh-context AI sub-agent was tasked with scanning the entire repository layer for the same anti-pattern. The sub-agent identified fourteen functions with uncached database calls that should have been wrapped in the application’s caching layer. It ranked them by business impact: root-level fetches that blocked downstream queries were prioritised over leaf functions. It correctly excluded functions that looked similar but were genuinely different — single-record lookups that didn’t benefit from caching, and functions where the call pattern made caching counterproductive. The human reviewer validated the ranked results, confirmed which to fix, and approved the sequencing. The sub-agent’s pattern definition was then promoted into an automated check wired into the continuous integration pipeline.

What changed. A single bug became a systemic fix. Fourteen functions were corrected in priority order rather than discovered one at a time through future performance regressions. The structured scan also produced a reusable prevention mechanism — a clear pattern definition checked automatically on every future code change. Reactive discovery became proactive prevention.

End-to-end testing — 83 minutes to 4.7 minutes through repeated autonomous Spawn Loops

A test suite had grown until full runs took longer than the team would tolerate before deploying. Around thirty waves of stateless subagents, each gated by repeated parallel validation, took the suite from 83 minutes to 4.7 minutes by parallelising 453 of 507 tests.

The problem. A multi-tenant SaaS platform had accumulated an end-to-end test suite over several years. Tests had been added one at a time, each correct in isolation, but collectively the suite ran sequentially because nobody had the time to audit every test for parallel safety. A full run took 83 minutes — long enough that engineers ran it overnight and read results in the morning, or skipped it before urgent deploys. Hidden ordering dependencies between tests — shared database fixtures, prefix-scoped teardown sweeps, cached read paths — meant that flipping a switch to parallel execution would have produced intermittent failures with no clear signal.

What the operating model changed. Two ingredients. First, human-guided diagnostic work: stabilising the production build harness, designing a strict promotion gate, isolating architectural constraints that required certain tests to stay sequential, and cleaning up infrastructure that had quietly degraded over time. None of this could be delegated — it was design and judgement work needing full architectural context. Second, the autonomous loop itself. A coordinating agent dispatched waves of six subagents at a time, each with disjoint file ownership and authority to fix issues their own validation surfaced. Every promotion had to pass repeated parallel runs at the target worker count with zero failures before it was committed. Output that failed the gate was discarded, not patched. Accepted output was committed to a dedicated branch, never to main. The loop ran roughly thirty waves across a series of overnight runs. The engineer’s only execution-time task each morning was reading the summary report and approving the merge.

What changed. Suite runtime fell from 83 minutes to 4.7 minutes. Of 507 tests in the final suite, 453 now run in parallel; the remaining 54 are deliberately-serial workflows — deal-creation chains, multi-step wizards, externally-coupled flows — where parallel execution is architecturally inappropriate. The suite now runs before every deploy. Beyond the runtime improvement, the systematic concurrent pressure surfaced real issues a manual audit would have missed — a production build divergence between two bundlers, a component remount regression visible only under timing stress, an architectural cache constraint that explained why one test could never be parallelised, and a slow accumulation of orphaned test data inflating an unrelated scheduled job. None of these were test infrastructure issues. They were product and architecture issues that the loop’s pressure made visible. The pattern is reusable: it applies to any long-tail mechanical work with a strict, automatable acceptance criterion and units that don’t share mutable state. Migrations, large refactors, accessibility remediation, API contract sweeps — anywhere the gate can be expressed as a passing test, type-check, build, or lint, the same loop applies.

Tenant security — a validation that had never actually worked

A security check on every route looked correct and passed code review. An AI audit found it was decorative — it had never rejected a single invalid request.

The problem. A multi-tenant application had a security validation on every public-facing route. The validation checked whether an incoming request referenced a legitimate tenant before allowing access. The code had been in place since the route layer was built. It had passed code review. It looked correct. It had never once rejected an invalid request.

What the operating model changed. An AI agent was tasked with a structured audit of specific subsystems — tenancy perimeter, authentication boundaries, and route handlers. The agent performed static analysis against the actual execution path, not a surface-level code read. It discovered that the validation function was called without an await keyword. In the language used, this meant the function returned a Promise object rather than the resolved result. A Promise object is always truthy. The validation therefore passed every input, every time, regardless of whether the tenant existed. The check was decorative — present in the code, visible in review, and completely inert at runtime. The same audit pass flagged additional issues: a proxy that failed open on database errors, fifteen unscoped tenant-enrichment joins, and an admin route with no authentication check.

What changed. The fix was one word. The discovery required a review methodology that didn’t trust appearance. A human reviewer reading the code would see a function call, see it used in a conditional, and move on — the pattern looks correct. Only a review that asks “does this actually reject bad input?” rather than “does this look like it would reject bad input?” catches the failure. Code review checks intent. Structured validation checks behaviour. The AI agent’s value was systematic coverage — it checked every route handler against the same criteria, without fatigue. The human’s value was defining what to check and deciding what the findings meant.

Want to apply this framework to your business?

Book a briefing, commission due diligence, or start with a conversation about where Plan, Implement, Review creates the most value in your operations.