The AI was perfect. For the first 46 files.

File 47: three localization keys missing. No EN, no DE. The AI got the pattern right, then drifted. And I’d never have checked file 47 manually.

That’s the verification gap. At scale, you can’t review every output. Specs describe what should happen—they don’t verify that it did. The gap that kills you isn’t missing specifications. It’s missing verification.

Why Tests Have a New Value

My approach is pragmatic: try ideas, see what sticks, then pin the important stuff with tests. Not before coding—before scaling. Not 100% coverage—the things that matter at scale, the things AI overlooks when the work gets monotonous.

When I have a structural problem—the AI keeps making the same category of mistake—I write a test that catches that structure. Then I tell the agent: “Keep going until this test is green.”

Some tests stay forever. Some get deleted when the scaling task is done.

Insight
Tests for AI work like a linter—not just for code, but for behavior and content. Fast signal: you’re on track, or you’re not. The best “done” signal you can get.

The test doesn’t care how the AI solved it. It cares that the constraint is met. This is the Carbonara Rule fix: what’s implicit becomes probabilistic—tests make it explicit. Not more documentation. Executable constraints that pin your intent.

The Validation Stack

Five techniques, from simple to sophisticated. Use what fits your needs.

Level 1: Self-Check (Ask Again)

The simplest validation: ask the AI to review its own work.

Me: Generate a field mapping for this Excel to our data model.
AI: [produces mapping]
Me: Now check: Did you cover every field from the Excel?
    Did you name them correctly? Show me what you missed.
AI: Oops—missed 3 fields, typo in 2 names. Fixing...

After the second pass, most obvious errors are gone. If there’s still something wrong after two passes, you have a deeper problem.

When to use: Quick sanity checks, first-pass cleanup, obvious error categories.

Limitation: Same blind spots, same context pollution. If the AI misunderstood the task, it will confidently verify its misunderstanding.

Level 2: Fresh Context / Different Model

Break the echo chamber. New conversation, different model, or a subagent with isolated context.

Agent A: Generates code
Agent B: Reviews code (never saw the implementation discussion)

Agent B doesn’t know why certain shortcuts were taken. It just sees the code and asks: “This looks weird. Why is this here?”

Why different models help: Each model has different training distributions, different tendencies, different blind spots. GPT might miss what Claude catches. Gemini might spot what both missed.

When to use: Important outputs, cross-checking critical logic, breaking confirmation bias.

Level 3: Structural Tests (Linter-Style)

Tests that look at the output structure, not the execution.

# Does the generated code use only allowed color codes?
ALLOWED_COLORS = ['#1E293B', '#DC2626', '#64748B']
for color in extract_colors(generated_code):
    assert color in ALLOWED_COLORS, f"Invalid color: {color}"

This isn’t a unit test that runs the code. It’s inspection. You’re asking: “Does this output have the properties I require?”

Examples:

  • All required fields present in JSON
  • No TODO comments in production code
  • File follows naming convention
  • Schema validation passes

When to use: Format consistency, constraint enforcement, catching drift in scaled operations.

Level 4: TDD for Structural Problems

When you identify a recurring problem, codify it as a test.

Real example from my customs work: The AI kept using implicit any types in TypeScript. Every time I caught it, I corrected it. Every time I wasn’t looking, it happened again.

So I wrote a test:

// No implicit any in visualization components
const implicitAnys = findImplicitAnys(generatedFiles);
expect(implicitAnys).toHaveLength(0);

Now the AI can’t proceed until types are explicit. The test catches the pattern I already identified.

Insight
Traditional unit tests verify that implementation matches specification. Tests for AI verify that output matches constraints. The AI’s implementation is a black box—you only care about the properties of the output.

When to use: Recurring problems, scaled generation (100+ similar outputs), automated pipelines.

Level 5: Hooks and Automatic Guardrails

The holy grail: tests that run automatically and inject their results into the AI’s next prompt.

Claude Code has hooks—triggers that fire after tool use, after each turn, or at specific checkpoints:

// PostToolUse hook: After every file write
{
  "event": "PostToolUse",
  "tool": "Write",
  "command": "npm run lint -- $file"
}

If the linter fails, the AI sees the errors immediately. It fixes them before moving on. No human intervention.

The closed loop:

  1. AI produces output
  2. Hook runs validation
  3. Results injected into context
  4. AI responds to issues
  5. Loop until clean

When to use: Fully autonomous pipelines, complex workflows, operations that must self-correct.

The Evaluator-Optimizer Pattern

For complex validation, use two agents:

Generator → Output → Evaluator → Feedback → Generator → ...

The evaluator can be:

  • Deterministic (scripts, schemas, rules)
  • LLM-based (another model judging quality)
  • Hybrid (rules + LLM for edge cases)

Loop until success or max retries. This pattern handles tasks where “correct” isn’t binary—where quality is a spectrum.

Real implementation:

for attempt in range(MAX_RETRIES):
    output = generator.produce(context)
    score, feedback = evaluator.assess(output)

    if score >= THRESHOLD:
        return output

    context.add_feedback(feedback)
    # Generator tries again with feedback

Tools That Exist (February 2026)

ToolWhat It DoesUse Case
TDD GuardHook-based TDD enforcement for Claude CodeEnforcing test-first for AI coding
DeepEval60+ metrics, pytest integration, LLM-as-judgeComprehensive AI output evaluation
PromptfooMatrix testing, red teamingPrompt robustness, edge case discovery
RagasRAG-specific evaluationFaithfulness, relevance, context precision
Claude Code HooksPre/Post tool triggersAutomatic validation on file operations

Brittle Tests Are Fine

Here’s something that surprised me: tests that sometimes fail are useful.

In traditional development, a flaky test is a bug. You fix it or delete it.

But with AI, I write tests that are more like warnings:

  • “This pattern looks like an anti-pattern—might be intentional, might not”
  • “This phrase appears in security-scanner word lists—probably fine, but check”
  • “This file is unusually large—could be correct, could be scope creep”

These aren’t blocking. They surface potential issues for human review. The AI can see them too and respond: “Yes, I’m aware this looks unusual. Here’s why it’s intentional…”

The pattern: Hard gates for must-not-violate constraints. Soft warnings for probably-should-check situations.

Putting It Together

A realistic validation stack for a content generation pipeline:

Phase 1: Generate

Level 1: Self-check ("Review what you wrote")

Level 3: Structural tests (JSON schema, word count, required sections)

Gate: Pass? Continue. Fail? Fix and retry.

Phase 2: Review

Level 2: Fresh context agent reviews

Level 4: Specific constraint tests (no forbidden phrases, link validity)

Gate: Pass? Continue. Fail? Escalate to human.

Phase 3: Finalize

Level 5: Hooks check format, publish readiness

Output

Each level catches different failure modes. Layered defense.

The Mindset Shift

Traditional testing: Verify that code behaves as written. AI validation: Verify that output has properties we require.

You’re not testing the AI’s implementation—that’s a black box. You’re testing the interface between AI output and your requirements.

Insight
The question isn’t “did the AI write correct code?” It’s “does the output meet our constraints?” That’s a fundamentally different question with fundamentally different tests.


Sources

  • TDD Guard — Hook-based TDD for Claude Code
  • DeepEval — LLM evaluation framework
  • Anthropic Engineering — Context engineering patterns
  • Real experience: 120+ customs visualizations with validation gates, 2025-2026

Deep Dives