The Validation Stack: 5 Techniques from Self-Check to Automatic Guardrails

The AI was perfect. For the first 46 files.

File 47: three localization keys missing. No EN, no DE. The AI got the pattern right, then drifted. And I’d never have checked file 47 manually.

That’s the verification gap. At scale, you can’t review every output. Specs describe what should happen—they don’t verify that it did. The gap that kills you isn’t missing specifications. It’s missing verification.

Why Tests Have a New Value

My approach is pragmatic: try ideas, see what sticks, then pin the important stuff with tests. Not before coding—before scaling. Not 100% coverage—the things that matter at scale, the things AI overlooks when the work gets monotonous.

When I have a structural problem—the AI keeps making the same category of mistake—I write a test that catches that structure. Then I tell the agent: “Keep going until this test is green.”

Some tests stay forever. Some get deleted when the scaling task is done.

Insight

Tests for AI work like a linter—not just for code, but for behavior and content. Fast signal: you’re on track, or you’re not. The best “done” signal you can get.

The test doesn’t care how the AI solved it. It cares that the constraint is met. This is the Carbonara Rule fix: what’s implicit becomes probabilistic—tests make it explicit. Not more documentation. Executable constraints that pin your intent.

The Validation Stack

Five techniques, from simple to sophisticated. Use what fits your needs.

Level 1: Self-Check (Ask Again)

The simplest validation: ask the AI to review its own work.

Me: Generate a field mapping for this Excel to our data model.
AI: [produces mapping]
Me: Now check: Did you cover every field from the Excel?
    Did you name them correctly? Show me what you missed.
AI: Oops—missed 3 fields, typo in 2 names. Fixing...

After the second pass, most obvious errors are gone. If there’s still something wrong after two passes, you have a deeper problem.

When to use: Quick sanity checks, first-pass cleanup, obvious error categories.

Limitation: Same blind spots, same context pollution. If the AI misunderstood the task, it will confidently verify its misunderstanding.

Level 2: Fresh Context / Different Model

Break the echo chamber. New conversation, different model, or a subagent with isolated context.

Agent A: Generates code
Agent B: Reviews code (never saw the implementation discussion)

Agent B doesn’t know why certain shortcuts were taken. It just sees the code and asks: “This looks weird. Why is this here?”

Why different models help: Each model has different training distributions, different tendencies, different blind spots. GPT might miss what Claude catches. Gemini might spot what both missed.

When to use: Important outputs, cross-checking critical logic, breaking confirmation bias.

Level 3: Structural Tests (Linter-Style)

Tests that look at the output structure, not the execution.

# Does the generated code use only allowed color codes?
ALLOWED_COLORS = ['#1E293B', '#DC2626', '#64748B']
for color in extract_colors(generated_code):
    assert color in ALLOWED_COLORS, f"Invalid color: {color}"

This isn’t a unit test that runs the code. It’s inspection. You’re asking: “Does this output have the properties I require?”

Examples:

All required fields present in JSON
No TODO comments in production code
File follows naming convention
Schema validation passes

When to use: Format consistency, constraint enforcement, catching drift in scaled operations.

Level 4: TDD for Structural Problems

When you identify a recurring problem, codify it as a test.

Real example from my customs work: The AI kept using implicit any types in TypeScript. Every time I caught it, I corrected it. Every time I wasn’t looking, it happened again.

So I wrote a test:

// No implicit any in visualization components
const implicitAnys = findImplicitAnys(generatedFiles);
expect(implicitAnys).toHaveLength(0);

Now the AI can’t proceed until types are explicit. The test catches the pattern I already identified.

Insight

Traditional unit tests verify that implementation matches specification. Tests for AI verify that output matches constraints. The AI’s implementation is a black box—you only care about the properties of the output.

When to use: Recurring problems, scaled generation (100+ similar outputs), automated pipelines.

Level 5: Hooks and Automatic Guardrails

The holy grail: tests that run automatically and inject their results into the AI’s next prompt.

Claude Code has hooks—triggers that fire after tool use, after each turn, or at specific checkpoints:

// PostToolUse hook: After every file write
{
  "event": "PostToolUse",
  "tool": "Write",
  "command": "npm run lint -- $file"
}

If the linter fails, the AI sees the errors immediately. It fixes them before moving on. No human intervention.

The closed loop:

AI produces output
Hook runs validation
Results injected into context
AI responds to issues
Loop until clean

When to use: Fully autonomous pipelines, complex workflows, operations that must self-correct.

The Evaluator-Optimizer Pattern

For complex validation, use two agents:

Generator → Output → Evaluator → Feedback → Generator → ...

The evaluator can be:

Deterministic (scripts, schemas, rules)
LLM-based (another model judging quality)
Hybrid (rules + LLM for edge cases)

Loop until success or max retries. This pattern handles tasks where “correct” isn’t binary—where quality is a spectrum.

Real implementation:

for attempt in range(MAX_RETRIES):
    output = generator.produce(context)
    score, feedback = evaluator.assess(output)

    if score >= THRESHOLD:
        return output

    context.add_feedback(feedback)
    # Generator tries again with feedback

Tools That Exist (February 2026)

Tool	What It Does	Use Case
TDD Guard	Hook-based TDD enforcement for Claude Code	Enforcing test-first for AI coding
DeepEval	60+ metrics, pytest integration, LLM-as-judge	Comprehensive AI output evaluation
Promptfoo	Matrix testing, red teaming	Prompt robustness, edge case discovery
Ragas	RAG-specific evaluation	Faithfulness, relevance, context precision
Claude Code Hooks	Pre/Post tool triggers	Automatic validation on file operations

Brittle Tests Are Fine

Here’s something that surprised me: tests that sometimes fail are useful.

In traditional development, a flaky test is a bug. You fix it or delete it.

But with AI, I write tests that are more like warnings:

“This pattern looks like an anti-pattern—might be intentional, might not”
“This phrase appears in security-scanner word lists—probably fine, but check”
“This file is unusually large—could be correct, could be scope creep”

These aren’t blocking. They surface potential issues for human review. The AI can see them too and respond: “Yes, I’m aware this looks unusual. Here’s why it’s intentional…”

The pattern: Hard gates for must-not-violate constraints. Soft warnings for probably-should-check situations.

Putting It Together

A realistic validation stack for a content generation pipeline:

Phase 1: Generate
↓
Level 1: Self-check ("Review what you wrote")
↓
Level 3: Structural tests (JSON schema, word count, required sections)
↓
Gate: Pass? Continue. Fail? Fix and retry.
↓
Phase 2: Review
↓
Level 2: Fresh context agent reviews
↓
Level 4: Specific constraint tests (no forbidden phrases, link validity)
↓
Gate: Pass? Continue. Fail? Escalate to human.
↓
Phase 3: Finalize
↓
Level 5: Hooks check format, publish readiness
↓
Output

Each level catches different failure modes. Layered defense.

The Mindset Shift

Traditional testing: Verify that code behaves as written. AI validation: Verify that output has properties we require.

You’re not testing the AI’s implementation—that’s a black box. You’re testing the interface between AI output and your requirements.

Insight

The question isn’t “did the AI write correct code?” It’s “does the output meet our constraints?” That’s a fundamentally different question with fundamentally different tests.

Sources

TDD Guard — Hook-based TDD for Claude Code
DeepEval — LLM evaluation framework
Anthropic Engineering — Context engineering patterns
Real experience: 120+ customs visualizations with validation gates, 2025-2026

Deep Dives

The Intent Gap: Why Slop Is a Specification Problem

Your AI output is generic because your intent is generic. The fix isn't better prompting—it's knowing what you actually want.

The Carbonara Rule: Why Autonomous AI Needs Surgical Precision

Ask 100 chefs for carbonara, get 100 different dishes. Autonomous AI works the same way—without surgical precision, you get the probabilistic average.

Lazy AI: When Your Model Finds the Shortcut You Didn't Know Existed

AI optimizes for 'test green,' not 'job well done.' I found out when my validation tests passed—but half the visualizations were empty.

04 Here

When AI checks itself, makes mistakes, and you need systematic correction. Five validation techniques, from quick sanity checks to fully automated guardrails.

The Stop Signal: Recognizing When AI Is Lost

When AI says 'sorry, you're right,' it's not being polite—it's telling you it's lost. The human skill of knowing when to stop, reset, and course-correct.