More structure does not automatically create better agent systems. Sometimes it just creates slower ones.

I wrote about this a few weeks ago in 38 Pages Nobody Reads. The pattern was already visible then: when AI systems disappoint, teams tend to react with more ceremony.

More specs. More phases. More orchestration. More meta-prompts. More process wrapped around the same underlying work.

The assumption is simple and deeply intuitive:

more structure = better results

That assumption keeps failing.

Someone actually tested it

Chase Levin recently ran a direct comparison in the video “GSD vs Superpowers vs Claude Code”.

Same task. Same broad requirements. Three approaches:

  • vanilla Claude Code
  • Superpowers
  • GSD

The job was not trivial. A landing page, a blog, and an internal generator that turns YouTube or article URLs into blog posts.

The interesting part was not that the heavyweight orchestration layers were bad. They were not. The interesting part was that the lightweight baseline held up much better than many people would expect.

The rough numbers from the test:

  • Claude Code: ~20 minutes, ~200k tokens
  • Superpowers: ~60 minutes, ~250k tokens
  • GSD: ~105 minutes, ~1.2M tokens

And the outputs?

Close enough that the time difference became the real story.

That is the part people keep underestimating. If the baseline system gets you 85 to 95 percent of the way there in one third of the time, the remaining difference is often cheaper to close through iteration than through up-front orchestration.

The cost of structure

Structure is not free.

It costs:

  • time
  • tokens
  • attention
  • cognitive switching
  • slower feedback loops
  • more ritual before you see reality

That does not mean structure is useless. It means structure has to earn its cost.

This is where a lot of agent work slips into what I call Spec Theater.

We add:

  • more planning artifacts
  • more step boundaries
  • more command scaffolding
  • more role definitions
  • more prompt layers

and then mistake the presence of ceremony for the presence of rigor.

But rigor is not “we created six markdown files before starting.”

Rigor is: did the system produce something useful, legible, and correct quickly enough to matter?

Anthropic has hinted at the same problem

Anthropic has said versions of this for months.

When customers complain that agents produce poor results, the problem is often not too little instruction. It is too much.

Too many prompt constraints. Too much over-specification. Too much railroading. Too many instructions that collapse the model’s search space before it can do any real work.

You think you are increasing control. In practice, you may just be reducing the model’s room to solve the problem.

That is an uncomfortable idea because it clashes with the classic engineering reflex.

If something is unreliable, we want to tighten it. We want more rules, more gates, more explicitness. We want to pin everything down in advance.

But with agentic systems, overconstraint can be as harmful as underconstraint.

Too little structure creates chaos. Too much structure creates theater.

AI does not decide. It opens the space.

That is the other half people miss.

The most valuable thing in these systems is not that the AI “decides” for me. I do not want drive-through strategy or a machine that replaces judgment.

The AI opens a space. It explores. It proposes. It gives shape to possibilities.

My job is still judgment.

That means the system has to let me do four things well:

  • see the effect of my guidance quickly
  • validate outputs quickly
  • correct assumptions quickly
  • steer direction quickly

This is why speed matters so much.

Not because fast is fashionable. Because fast feedback is what allows judgment to stay alive in the loop.

If every interaction disappears into 90 minutes of orchestration overhead, I do not have a better system. I have a slower conversation with less usable feedback.

The design point is adaptive structure

So the question is not:

Should systems have structure or not?

That is too primitive.

The better question is:

When should a system increase structure, and when should it stay lightweight?

That is where I think the next design generation is going.

Not maximal orchestration. Not minimal orchestration.

Adaptive orchestration.

Or more broadly:

adaptive systems.

The baseline should be light. Fast. Cheap. Interactive. Easy to steer.

Then the structure should rise only when the work actually justifies it:

  • ambiguity is high
  • risk is high
  • approval matters
  • handoffs matter
  • the task is long-running
  • the context surface is large
  • the work needs durable checkpoints

In other words: progressive structure, not permanent ceremony.

Why this matters beyond Claude Code

This is not really about one benchmark video.

It is a general warning for everyone building agent products, internal AI systems, and workflow runtimes.

If your system makes the common case slower in order to make the rare edge case look more controlled, users will route around you.

If your architecture assumes every task deserves the full ritual, you are building for your own comfort, not for the actual work.

And if your answer to every failure is “add another layer,” you are probably producing theater, not leverage.

The real challenge is harder and more interesting:

How do we build systems that are simple when the work is simple, structured when the work is risky, and explicit when the work needs memory, accountability, or handoff?

That is the direction I am designing toward now.

Not permanent orchestration. Not workflow cosplay.

Adaptive systems instead of Spec Theater.

More on that soon.


Sources

  • Chase Levin: “GSD vs Superpowers vs Claude Code” — practical comparison of heavyweight orchestration vs baseline Claude Code
  • s16e: 38 Pages Nobody Reads — earlier argument against spec-driven ceremony in AI development
Paste into LinkedIn articles, Notion, or any rich text editor.