AI Engineering: Context, Iteration, Reliability • s16e

I kept seeing “zweites Kind im Mai” pop up in code reviews. My family context was leaking into technical output. The problem wasn’t my prompts—it was context pollution. Too much irrelevant information competing for attention.

LLMs have an attention budget. Every token competes for it. Load 50,000 tokens of “maybe useful” context, and the signal drowns in noise. The fix isn’t better prompts. It’s fewer, better-selected tokens.

Insight

Anthropic’s engineering team puts it directly: “Identify the smallest possible collection of high-signal tokens that maximize the probability of achieving desired outcomes.”

What Context Engineering Actually Is

Prompt engineering: Finding the right words. Context engineering: Architecting what information the model receives, when, and how.

Most people obsess over prompt phrasing. That’s 10% of the story. The other 90% is the context window—what’s loaded, what’s not, and how it’s structured.

The Stiefel Problem

In German, we say someone “macht seinen Stiefel” when they just do their thing without adapting. Autopilot. Default mode.

Without strong direction, AI makes its Stiefel. Ask for “a landing page” and get the same hero section every AI produces. Ask for “marketing copy” and get the same buzzword soup. The training data wins. You get the average.

The fix isn’t more instructions. Challenge the default:

“What makes this different from every other landing page you’ve written?”
“What would someone who disagrees say?”
“Is this still true in 2025, or just repeated wisdom?”

The Constraint Trap

Here’s what surprised me: too much constraint is also bad.

I used to write prompts like “CRITICAL: You MUST follow these 47 rules exactly.” The LLM tried desperately to satisfy everything—and outputs got worse, not better.

Anthropic confirmed this in customer reviews. They’d find poor performance, look at the prompts, and see walls of aggressive instructions. The fix? “Dial back aggressive language.” Claude 4.x doesn’t need CRITICAL: YOU MUST. Normal language works.

LLMs follow everything you give them. Including contradictions. Including noise. They’ll try to satisfy constraints that conflict, producing weird compromises instead of good output. More rules don’t make better output—they verschlimmbessern it.

Dirty Input Is a Feature

Here’s what most people get wrong: perfecting your input is a waste of time.

I dictate. Messy. Typos. Incomplete sentences. The AI understands intent. I correct outputs, not inputs. That’s faster.

Most people polish prompts for 10 minutes, get output, throw it away, polish again. I stream consciousness, let AI interpret, course-correct in real-time. The difference: I’m not trying to get it right the first time. I’m iterating faster.

Frustration = Architectural Smell

When you’re frustrated with AI output, that’s not “AI is dumb.” That’s a signal.

Your frustration means your context architecture is wrong:

Wrong information loaded
Too much information (attention budget exhausted)
Conflicting constraints (verschlimmbessern)
No challenge points (Stiefel mode)

Debug your context, not your prompt.

Be a Conductor, Not an Author

Author Mode	Conductor Mode
Perfect the prompt	Stream intent, correct output
Wait for completion, then review	Interrupt and redirect in real-time
Explain rules	Show examples
Sequential tasks	Parallel threads
AI writes for me	AI extends my thinking

The conductor doesn’t play every instrument. They give direction, listen, adjust. The orchestra plays. That’s how I work with AI.

What Actually Works: Patterns from Daily Use

These patterns emerged from running Praxis—a context engineering system I use daily for development, meeting analysis, research, and writing.

1. Context as Map, Not Library

My CLAUDE.md used to be 38,000 characters. Everything loaded upfront. “Just in case.”

Now it’s ~20,000 characters (still working on it). Core identity + a knowledge map showing WHERE information lives. Context gets loaded on-demand when actually needed.

CLAUDE.md = Map (always loaded)
├── Who I am (100 lines max)
├── Where things are (folder structure)
└── How to load more (when relevant)

Context Files = Library (loaded on-demand)
├── team-dynamics.md (only for meetings)
├── code-patterns.md (only for dev)
└── domain-knowledge.md (only when relevant)

2. Challenge Points (Everywhere)

Not just for research—for everything. Challenge points break the Stiefel:

Temporal: “What might have changed recently? Is training data outdated here?”
Common Wisdom: “Is this just repeated knowledge that was never verified? Like ‘searing seals in the juices’—sounds true, widely believed, completely wrong.”
Gegenpositionen: “What’s the alternative view? Who disagrees?”
Blind Review: Agent reviews code without seeing why decisions were made. Fresh eyes.
Blind Testing: Agent writes tests from spec, never sees implementation. Can’t confirm bugs as features.

Without challenge, AI takes the path of least resistance.

3. Phased Execution with Validation Gates

Complex tasks broken into phases, with explicit checkpoints:

Phase 0: Context → Validation Gate
Phase 1: Implementation → Validation Gate
Phase 2: Testing (implementation-blind) → Validation Gate
Phase 3: Review (context-blind) → Validation Gate
Phase 4: Finalize

Why blind testing? The testing agent doesn’t see implementation code. It writes tests from the spec alone. This prevents confirmation bias—it can’t accidentally verify implementation quirks as “correct.”

Why blind review? The review agent doesn’t see the chat context about why certain decisions were made. It reviews the code cold. Fresh eyes catch what polluted-context eyes miss.

4. Feedback Loops with Measurement

After sessions, I run retrospectives that track:

Which commands worked well / poorly
What patterns emerged (positive and negative)
Actual quotes as evidence
Mood and energy signals

This data feeds back into prompt iteration. Not “I think this is better”—“the satisfaction score dropped when I added X constraint.”

5. Conversation Anchors

Long sessions have attention limits. Earlier context competes with recent context. But distinctive phrases cut through:

Me: “If you change things we’re not discussing, ich ziehe dir die Ohren lang.”

(An hour later, AI starts drifting again)

Me: “Remember your ears.”

Claude: Und ich passe auf meine Ohren auf 🙉

Two words activated the full context. The phrase was unique enough to become an instant anchor—no re-explanation needed.

Why this works:

Emotional/absurd language sticks better than polite requests
The weirder, the more unique in the attention space
Short recall (“your ears”) activates the full context

This is the Fred Flintstone Method applied to conversations. Unique beats common—whether you’re searching documents or recalling context.

6. Scripts for Reliability

Where I need deterministic behavior, I don’t trust prompts alone. Scripts constrain the output space:

# Example: Emotion types in retro command
VALID_EMOTIONS = ['frustrated', 'stressed', 'tired', 'neutral',
                  'focused', 'accomplished', 'flow', 'euphoric']

# LLM can't invent new categories - only select from valid ones

LLMs are like side-effect-free functions: same context + same parameters = same result. Control the inputs, you control the outputs.

The Real Implementation: withPraxis

Everything here is packaged in withPraxis—an open-source context engineering framework for Claude Code:

Layered Context: Identity → Knowledge Map → On-Demand Loading
Slash Commands: /research, /dev:implement, /meeting:analyze—each with built-in validation gates
Pattern Recognition: Detects unsustainable work patterns and challenges them
Self-Maintaining: Retro commands, telemetry, iteration loops

It’s not a prompt template. It’s an operational system that knows your context and challenges you when needed.

Key Principles

Less is more. Attention budget is real. High-signal tokens only.
Normal language works. Claude 4.x doesn’t need aggressive prompting.
Constraints can hurt. Too many rules = verschlimmbessern. Trust the model more.
Challenge everything. Break the Stiefel. Temporal, common wisdom, gegenpositionen.
Dirty input, clean output. Correct downstream, not upstream.
Frustration is data. Debug your context, not your prompt.
Conduct, don’t author. Direct the system, interrupt when wrong, iterate fast.

Sources

Based on:

Anthropic: Context Engineering for AI Agents — The “attention budget” insight
Anthropic: Claude 4 Best Practices — Why aggressive prompting backfires
Anthropic: Multi-Agent Research System — Agent isolation patterns
6+ months running Praxis system daily on real workflows

Deep Dives

Working Iteratively: The Conductor Pattern

Stop perfecting prompts. Start conducting sessions. How to work with AI in real-time.

Building Reliable Pipelines: Same Quality Every Time

When you need consistent results, not creative exploration. Phased execution, validation gates, and prompt chains.