Thomas Sievering

Beyond CLAUDE.md

Fri, 13 Mar 2026 00:00:00 GMT

Most teams discover CLAUDE.md, paste a few rules into it, and call it done. Better than nothing. Still not enough.

If you want agent output you can trust, you need a harness. Not a prompt. A harness.

Think of it like this: CLAUDE.md sets defaults. The harness defines behavior.

What a harness actually is

Harness engineering is the work around the model, not inside it. You shape what the agent can see, what it can run, and how it gets corrected when it drifts.

In practice, that usually means three layers:

Project instructions in CLAUDE.md so every run starts with your standards.
Skills for repeatable workflows the model should execute the same way every time.
Custom CLIs that expose your systems as safe, narrow interfaces instead of free-form shell chaos.

That stack is where reliability comes from. Not clever prompting.

Layer 1: CLAUDE.md sets the ground rules

Your CLAUDE.md should answer questions before the model asks them: code style, architecture boundaries, how to run tests, what not to touch, when to ask for review.

Good instruction files reduce ambiguity. They do not remove judgment. The model still needs structure around execution, which is where the next two layers matter.

Layer 2: skills stop you from re-explaining workflows

Any process you repeat more than twice should become a skill. Bug triage. UI verification. Release notes. Dependency audits. Whatever your team runs every week.

Without skills, the agent improvises. Sometimes it improvises well. Sometimes it invents a new process at 2 a.m. that nobody can debug later.

Skills turn "please do this carefully" into a deterministic runbook.

Layer 3: custom CLIs create safe boundaries

Most enterprise systems were not designed for LLM-first workflows. That is why custom CLIs matter. You give the agent a stable command surface with explicit inputs, explicit outputs, and known failure modes.

Instead of asking an agent to poke APIs directly, you hand it commands like:

t-linear issue SIE-27
t-linear comment SIE-27 "progress update"
t-linear update SIE-27 --state "In Review"

Now it can move fast without guessing schema details, endpoint behavior, or auth flows. You moved complexity out of the prompt and into tooling where it belongs.

Back pressure is part of the harness

A harness without feedback is still a guessing machine. Add checks that force reality back into the loop: tests, linters, screenshots, and build verification.

This is the same pattern behind strong CI pipelines. Agentic workflows just make it more obvious: if your system does not push back, low-quality output accumulates fast.

Where to start this week

Rewrite CLAUDE.md to include concrete commands and non-negotiable boundaries.
Extract one repeated workflow into a skill.
Wrap one internal API into a focused CLI command the agent can call safely.

Do those three things and your agent quality changes immediately.

Want the full system?

This post maps to Module 4 of my course concept: Harness Engineering. In the workshop, we build this setup end-to-end on a live project, not just slides.

If you want to go deeper, check the full course concept or see the workshop formats.

Your Context Window Is Smaller Than You Think

Mon, 02 Mar 2026 00:00:00 GMT

Every API call starts from zero. The model doesn't remember your last message. It re-reads the entire conversation — system prompt, tool definitions, every message, every tool result — from scratch. Every single time.

That means your context window isn't memory. It's a budget. And you're spending it faster than you think.

Claude advertises 200k tokens. That's roughly 150,000 words. Sounds massive. But output tokens have their own cap — typically 8–64k — and tools like compaction reserve space on top of that. The actual usable window is smaller than 200k before you type a single message.

The dumb zone

Researchers have consistently found that model performance follows a U-curve across the context window. Information at the beginning and end gets attention. Everything in the middle gets lost. The original Stanford paper called it "lost in the middle" in 2023 (paper) — and Chroma Research confirmed it still holds across 18 current models in 2025. It's not a bug being fixed. It's architectural.

In practice, this means your context window has quality zones. The first 40% or so is clean — the model follows instructions, output is precise. Between 40–70%, it starts cutting corners. Past 70%, you're in what Geoffrey Huntley calls the dumb zone. Instructions get ignored. Hallucinations increase. The model isn't broken. It's drowning in tokens.

A 200k window at 70% is 140k tokens. That sounds like a lot of runway before things go wrong. But system prompts, tool definitions, and MCP server configs eat a chunk before you type a single message. In a coding agent session with a few file reads and tool calls, you can hit 70% faster than you'd expect.

Context poisoning

Here's what actually fills your context: noise. You try an approach, it fails. You try another. The first attempt doesn't disappear — it sits there, confusing the model. Old file contents, abandoned instructions, contradictory guidance from 20 messages ago. The model treats it all as equally valid. It can't tell current intent from stale context.

This is context poisoning. And it compounds — agent success rates measurably drop after about 35 minutes of continuous operation.

Why compaction doesn't save you

When context fills up, tools try to save you by summarizing older messages. Sounds reasonable. But summarizing loses specifics — a file path becomes "the auth module," an exact error becomes "a type error." And compressed noise is still noise.

Worse: compaction keeps you near the ceiling. You compact from 90% down to 70% and you're still in the dumb zone. You never get back to clean.

The fix: `/clear` often

The counterintuitive move: throw it away. Start fresh. A new context with just your CLAUDE.md and the current task puts you at maybe 5% of capacity — deep in the high-quality zone. Your project instructions reload at the top of the window, exactly where the model pays the most attention.

Clear between tasks. Clear when the agent starts repeating mistakes. Clear when you feel output quality dipping. It's free and it works better than any clever context management trick.

The Ralph Loop

Geoffrey Huntley took this idea to its logical extreme with the Ralph Loop — a bash loop that runs a coding agent repeatedly, each iteration getting a fresh context with the full spec reloaded:

while :; do cat PROMPT.md | claude-code ; done

One task per iteration. Fresh context every time. The spec files are the durable part — code is disposable, reshaped every iteration. He documented completing a $50,000 contract for $297 in compute costs using this pattern.

It works because it sidesteps every problem above: no dumb zone, no poisoning, no compaction. Just a clean window, a clear spec, and one focused task.

This is context engineering

The discipline of managing what goes into the window and what gets cut. There's more to it: back pressure, spec-driven workflows, sub-agents, tool budgets. But it all starts here — understanding that your context window is smaller than you think, and /clear is your best tool.

Let Your Agent See What It Builds in SAP BAS

Sun, 01 Mar 2026 00:00:00 GMT

If you let an agent write code without seeing the result, it's guessing. Back pressure matters — the agent needs to see what it built, verify it works, catch what's off. Same idea as TDD: don't trust the output, check it.

So I wanted my agent to take screenshots of the UI5 app it was building in SAP Business Application Studio. Simple ask. Except BAS runs inside a Kubernetes container with no display server, no Chrome, and missing system libraries. Puppeteer fails. Playwright fails.

If you've worked in the SAP world, you know the feeling. Half the battle is figuring out what the platform even lets you do.

I spent way too long trying to make Puppeteer work before accepting the obvious. Sometimes the answer isn't "try harder," it's "wrong tool."

The answer is wdi5 — a WebdriverIO plugin built for UI5 apps — combined with a BAS plugin most people don't know about: the Headless Testing Framework.

Enable the plugin first

Go to SAP BTP Cockpit → your subaccount → Business Application Studio
Stop your dev space
Edit → Additional SAP Extensions → check "Headless Testing Framework"
Save and restart

This installs Firefox ESR and geckodriver into the container. That's your browser runtime. The only one you get.

Four steps. A checkbox. That's what stood between me and a working screenshot pipeline.

Project setup

Once the plugin is active:

npm init wdi5@latest

That scaffolds the config, test directory, and an npm script. Screenshots are one line:

await browser.saveScreenshot("./screenshots/01-home-page.png");

Now your agent can take screenshots inside BAS. It writes code, runs the tests, sees the result. Actual back pressure.

Next step: wrapping this into a Claude Code skill so the agent triggers it on its own. But that's a different post.

The Context Reset Rhythm That Actually Works

Wed, 25 Feb 2026 00:00:00 GMT

I stopped trying to save every token in one endless session. Reset rhythm works better.

Models degrade as context grows. You see skipped instructions, repeated mistakes, and rising hallucination risk.

The cadence

I split work into short chunks with explicit reset points:

Finish one concrete task.
Write a short checkpoint note in repo files.
Run /clear and reload only current context.
Start the next task with fresh state.

This keeps sessions out of the dumb zone and cuts prompt drift.

What to keep between resets

Keep only durable artifacts: specs, todo list, failing test output, and current branch state.

Do not keep long chat debates. They are prime context poisoning.

Why this beats memory tricks

People try to solve this with giant memory layers and compaction chains. That helps a bit, then falls apart again.

A clean reset with clear written state is simpler and more reliable in day-to-day coding.

Use rhythm, not hero sessions

Agent quality is not about one perfect prompt. It is about repeatable operating rhythm.

Small tasks, hard checks, frequent resets. Boring, and very effective.

Language Choice and Agent Effectiveness

Sat, 14 Feb 2026 00:00:00 GMT

I used to treat language choice as a team preference topic. With coding agents, it is now also an execution topic.

Some stacks give cleaner errors, faster tooling, and easier static checks. Agents thrive there.

What helps agents most

Fast code search and predictable file layout.
Strong type feedback from compiler or checker.
Small, reliable test commands.
Clear dependency and build scripts.

None of this is new for humans. Agents just magnify the gap between disciplined and messy repos.

Why static feedback matters

Good type errors are immediate back pressure. They reduce fake confidence and shorten repair cycles.

In weakly checked stacks, the model can look correct for a long time before runtime proves otherwise.

Tooling quality is part of language choice

A language with slow or fragile tooling hurts agents more than humans. Every retry costs tokens and time.

I now score stack choices by automation friendliness, not only developer familiarity.

No silver bullet, just trade-offs

You can run agents in almost any language. But if you want high throughput, pick ecosystems with tight feedback loops.

Language still matters. The difference now is who benefits first: your automation pipeline.

Research, Plan, Implement: Ship Faster With Agents

Fri, 06 Feb 2026 00:00:00 GMT

Most agent sessions fail because we mix discovery and implementation in one pass. The model keeps changing direction mid-run.

I now split work into three steps: research, plan, implement. It looks slower. It ships faster.

Step 1: research only

Read files, map constraints, find existing patterns. No edits yet. This keeps the model from guessing architecture.

I ask for references with file paths and exact commands. Evidence first.

Step 2: plan in writing

Turn findings into a short execution plan with risk notes and test impact. This becomes durable context for the next run.

That document is part of context engineering. It reduces drift when sessions reset.

Step 3: implement with checks

Now edit code, run tests, fix failures. Keep the loop tight: change, verify, commit.

If quality drops, clear context and continue from the written plan instead of arguing in chat.

Why it works

Separating thinking modes lowers contradiction in the prompt. The model does not have to infer whether it should explore or execute.

You get less thrashing, fewer reverts, and clearer commit history.

Build a Ralph Loop From Scratch

Wed, 28 Jan 2026 00:00:00 GMT

The Ralph Loop idea is simple: short autonomous runs, fresh context each time, strict verification in between.

It sounds almost too basic. It works because it avoids the long-session failure modes we keep seeing.

Minimal setup

You need three files:

A spec file with goal, constraints, and done criteria.
A task input file for the current iteration.
A verify script that can fail hard on bad output.

Then run loop, verify, and only keep good changes.

Why this beats marathon sessions

Each iteration starts clean, so context poisoning does not accumulate. You also stay far from the dumb zone.

The model does one focused unit of work, not twenty mixed goals in one thread.

Where people get stuck

Most failures come from weak verification. If your checks are soft, the loop just ships bad work faster.

Strong back pressure is the point. Tests, lints, build, optional screenshots. No green, no merge.

Start small

Run this pattern on one feature branch first. Keep iteration scope tiny. One task that fits in a short session.

Once that is stable, increase autonomy. Do not start with a giant all-day loop.

Lost in the Middle: A Practical Playbook

Fri, 16 Jan 2026 00:00:00 GMT

The lost-in-the-middle effect is not theory anymore. You can see it in daily agent sessions.

Important rules buried halfway down a long prompt get skipped. Then people blame the model. Most times, this is placement, not intelligence.

Where instructions should live

Put non-negotiables at the top: coding rules, forbidden paths, test commands. Repeat critical items near the end if needed.

Middle sections should hold disposable detail. If that part gets weaker attention, nothing critical breaks.

My prompt layout

Top: constraints and completion bar.
Middle: context notes, references, optional detail.
Bottom: current task and explicit next command.

That shape follows how attention behaves across a long context window.

When drift starts

If output quality drops, I do not add more text. I reset and reload. Fresh context beats rescue prompting.

This is why compaction often disappoints. You stay near the noisy zone and hope for miracles.

One practical rule

If a rule matters, it should appear in a durable file and in the active task prompt. Redundancy on purpose.

That one habit cut repeat mistakes across my sessions more than any model upgrade.

Tool Overload: The Hidden Cost of MCP Servers

Wed, 07 Jan 2026 00:00:00 GMT

Adding tools feels like progress. Until the agent starts slowing down and missing obvious instructions.

Every tool definition lives in the prompt budget. In a crowded context window, that cost is real.

The silent tax

People count tool capability, not tool overhead. I count both. A tool that gets used once a week should not sit in every session.

Unused tools are just token rent. You pay it on every turn.

What too many tools break

Selection quality drops. The model picks the wrong tool more often.
Latency rises. Bigger prompts mean slower calls.
Instruction focus drops in the middle of the window.
Debugging gets harder because failure paths multiply.

This is the same pattern as bloated microservice APIs. More surface, more mistakes.

My trim policy

I keep only tools tied to current work. Everything else stays disabled. If a tool cannot prove weekly value, it gets removed.

Then I wrap core actions in stable CLI commands with clear IO. Better tool use, less improvisation.

Small set, strong checks

A lean tool set plus strong back pressure beats a giant toolbox every time. Fewer choices, better execution.

Agents are not blocked by missing tools. They are blocked by unclear systems.

Sub-Agents and Context Isolation

Fri, 19 Dec 2025 00:00:00 GMT

One giant agent session looks productive until it collapses. Too many file reads, too many failed attempts, too much stale intent in one thread.

I get better results with sub-agents. Each one gets a narrow scope, short runtime, and explicit output format.

Main agent as scheduler

The primary agent should coordinate, not do everything. It delegates search, refactor prep, or test analysis to smaller workers.

This keeps the main context window clean. It also lowers the chance of hallucination from old noise.

Where sub-agents help most

Large codebase search and summarization.
Test failure triage with exact repro notes.
Migration prep where one module at a time is safer.
Documentation extraction from many files.

These tasks create big token footprints. Isolation keeps that load out of the main flow.

Isolation rules I use

Give each sub-agent one question. One output contract. One stop condition. No open-ended "go explore." That is how sessions stay fast.

I also merge only verified outputs back into the main thread. Otherwise you import context poisoning from failed runs.

The payoff

You get fewer surprises, cleaner diffs, and easier review. The big win is not speed. It is predictability.

Context isolation feels like overhead at first. After a week, it feels like basic safety.

Specs Over Code: The PRD Is the Asset

Fri, 12 Dec 2025 00:00:00 GMT

In agentic projects, code is temporary. The spec is durable.

I know that sounds backwards. We used to treat code as the source of truth. With agents, code gets reshaped constantly. If your spec is weak, every run drifts.

What survives a reset

When I run /clear, the model loses chat history. It does not lose my repo files. So I put important decisions in spec docs, not in chat messages.

A good spec carries intent across sessions. It keeps quality stable even when context poisoning starts showing up.

What I put in the spec

Goal: one outcome, written in plain language.
Constraints: files to avoid, APIs to keep, failure conditions.
Checks: exact commands that must pass before done.
Done criteria: what "shipped" means in this repo.

That is enough for reliable execution. Anything extra is usually noise.

Prompt-driven work does not scale

Prompt-only flows feel fast on day one. By day ten nobody remembers why a rule exists. Teams re-explain the same thing in every session and burn tokens doing it.

Spec-driven flow fixes that. The model reads stable docs, not random chat leftovers. That is core context engineering.

Write for execution, not for slides

Most PRDs are written for humans in meetings. Agent-readable specs are different: direct language, short sections, concrete commands, zero hand-wavy text.

If an agent can run your task from spec and pass checks, your spec is good. If not, rewrite it before touching code.

Back Pressure: How Agents Know They're Wrong

Fri, 05 Dec 2025 00:00:00 GMT

An agent without feedback is just a fast guesser. It writes code, sounds confident, and misses the bug right in front of it.

The fix is back pressure. Force reality into the loop. Tests fail. Types fail. Lint fails. Screenshots fail. The agent adapts or it stops.

Confidence is cheap

LLMs are trained to continue text, not to prove truth. So if your setup only asks for a diff, you get polished nonsense surprisingly often.

I stopped asking agents for "clean code." I ask for passing checks. Same task, better output.

Four pressure points that work

Types: static checks catch structural mistakes before runtime.
Tests: behavioral checks catch wrong assumptions.
Linters: style and risky patterns get flagged early.
UI evidence: screenshots show if the interface is actually right.

This is just engineering hygiene. In an agent flow, it becomes non-negotiable.

Back pressure beats longer prompts

When output drifts, most people add more prompt text. I do the opposite. Keep instructions short, strengthen checks.

Prompt text can be ignored in a crowded context window. A failing test cannot be ignored. It blocks progress immediately.

The practical loop

My default run looks like this: task, diff, tests, patch, repeat. If the agent keeps failing, I /clear, reload spec, and run again with fresh context engineering.

That one change made agent sessions calmer and cheaper. Less debate with the model. More verifiable output.