<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Thomas Sievering</title>
    <link>https://siever.ing</link>
    <description>Agentic engineer building reliable AI agents for real workflows.</description>
    <language>en</language>
    <atom:link href="https://siever.ing/feed.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>Beyond CLAUDE.md</title>
      <link>https://siever.ing/harness-engineering-claude-skills-cli/</link>
      <guid>https://siever.ing/harness-engineering-claude-skills-cli/</guid>
      <pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate>
      <description>Most teams stop at a prompt file. Real harness engineering adds skills, custom CLIs, and feedback loops that let agents ship reliably.</description>
      <content:encoded><![CDATA[<p>Most teams discover <code>CLAUDE.md</code>, paste a few rules into it, and call it done. Better than nothing. Still not enough.</p>

<p>If you want agent output you can trust, you need a harness. Not a prompt. A harness.</p>

<p>Think of it like this: <code>CLAUDE.md</code> sets defaults. The harness defines behavior.</p>

<h2>What a harness actually is</h2>

<p>Harness engineering is the work around the model, not inside it. You shape what the agent can see, what it can run, and how it gets corrected when it drifts.</p>

<p>In practice, that usually means three layers:</p>

<ol>
  <li><strong>Project instructions</strong> in <code>CLAUDE.md</code> so every run starts with your standards.</li>
  <li><strong>Skills</strong> for repeatable workflows the model should execute the same way every time.</li>
  <li><strong>Custom CLIs</strong> that expose your systems as safe, narrow interfaces instead of free-form shell chaos.</li>
</ol>

<p>That stack is where reliability comes from. Not clever prompting.</p>

<h2>Layer 1: CLAUDE.md sets the ground rules</h2>

<p>Your <code>CLAUDE.md</code> should answer questions before the model asks them: code style, architecture boundaries, how to run tests, what not to touch, when to ask for review.</p>

<p>Good instruction files reduce ambiguity. They do not remove judgment. The model still needs structure around execution, which is where the next two layers matter.</p>

<h2>Layer 2: skills stop you from re-explaining workflows</h2>

<p>Any process you repeat more than twice should become a skill. Bug triage. UI verification. Release notes. Dependency audits. Whatever your team runs every week.</p>

<p>Without skills, the agent improvises. Sometimes it improvises well. Sometimes it invents a new process at 2 a.m. that nobody can debug later.</p>

<p>Skills turn "please do this carefully" into a deterministic runbook.</p>

<h2>Layer 3: custom CLIs create safe boundaries</h2>

<p>Most enterprise systems were not designed for LLM-first workflows. That is why custom CLIs matter. You give the agent a stable command surface with explicit inputs, explicit outputs, and known failure modes.</p>

<p>Instead of asking an agent to poke APIs directly, you hand it commands like:</p>

<pre><code>t-linear issue SIE-27
t-linear comment SIE-27 "progress update"
t-linear update SIE-27 --state "In Review"</code></pre>

<p>Now it can move fast without guessing schema details, endpoint behavior, or auth flows. You moved complexity out of the prompt and into tooling where it belongs.</p>

<h2>Back pressure is part of the harness</h2>

<p>A harness without feedback is still a guessing machine. Add checks that force reality back into the loop: tests, linters, screenshots, and build verification.</p>

<p>This is the same pattern behind strong CI pipelines. Agentic workflows just make it more obvious: if your system does not push back, low-quality output accumulates fast.</p>

<h2>Where to start this week</h2>

<ol>
  <li>Rewrite <code>CLAUDE.md</code> to include concrete commands and non-negotiable boundaries.</li>
  <li>Extract one repeated workflow into a skill.</li>
  <li>Wrap one internal API into a focused CLI command the agent can call safely.</li>
</ol>

<p>Do those three things and your agent quality changes immediately.</p>

<h2>Want the full system?</h2>

<p>This post maps to Module 4 of my course concept: Harness Engineering. In the workshop, we build this setup end-to-end on a live project, not just slides.</p>

<p>If you want to go deeper, check the <a href="/workshops/course-concept/">full course concept</a> or see the <a href="/workshops/">workshop formats</a>.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Your Context Window Is Smaller Than You Think</title>
      <link>https://siever.ing/context-window-smaller-than-you-think/</link>
      <guid>https://siever.ing/context-window-smaller-than-you-think/</guid>
      <pubDate>Mon, 02 Mar 2026 00:00:00 GMT</pubDate>
      <description>Why most people misunderstand context windows, and why /clear is the most underrated tool in your workflow.</description>
      <content:encoded><![CDATA[<p>Every API call starts from zero. The model doesn't remember your last message. It re-reads the entire conversation — system prompt, tool definitions, every message, every tool result — from scratch. Every single time.</p>

<p>That means your <a href="/lexicon/context-window/">context window</a> isn't memory. It's a budget. And you're spending it faster than you think.</p>

<p>Claude advertises 200k <a href="/lexicon/token/">tokens</a>. That's roughly 150,000 words. Sounds massive. But output tokens have their own cap — typically 8–64k — and tools like <a href="/lexicon/compaction/">compaction</a> reserve space on top of that. The actual usable window is smaller than 200k before you type a single message.</p>

<h2>The <a href="/lexicon/dumb-zone/">dumb zone</a></h2>

<p>Researchers have consistently found that model performance follows a U-curve across the context window. Information at the beginning and end gets attention. Everything in the middle gets lost. The original Stanford paper called it <a href="/lexicon/lost-in-the-middle/">"lost in the middle"</a> in 2023 (<a href="https://arxiv.org/abs/2307.03172">paper</a>) — and <a href="https://research.trychroma.com/context-rot">Chroma Research confirmed it still holds</a> across 18 current models in 2025. It's not a bug being fixed. It's architectural.</p>

<p>In practice, this means your context window has quality zones. The first 40% or so is clean — the model follows instructions, output is precise. Between 40–70%, it starts cutting corners. Past 70%, you're in what <a href="https://ghuntley.com/agent/">Geoffrey Huntley</a> calls the dumb zone. Instructions get ignored. <a href="/lexicon/hallucination/">Hallucinations</a> increase. The model isn't broken. It's drowning in tokens.</p>

<p>A 200k window at 70% is 140k tokens. That sounds like a lot of runway before things go wrong. But system prompts, tool definitions, and MCP server configs eat a chunk before you type a single message. In a coding agent session with a few file reads and tool calls, you can hit 70% faster than you'd expect.</p>

<h2><a href="/lexicon/context-poisoning/">Context poisoning</a></h2>

<p>Here's what actually fills your context: noise. You try an approach, it fails. You try another. The first attempt doesn't disappear — it sits there, confusing the model. Old file contents, abandoned instructions, contradictory guidance from 20 messages ago. The model treats it all as equally valid. It can't tell current intent from stale context.</p>

<p>This is context poisoning. And it compounds — agent success rates measurably drop after about 35 minutes of continuous operation.</p>

<h2>Why compaction doesn't save you</h2>

<p>When context fills up, tools try to save you by summarizing older messages. Sounds reasonable. But summarizing loses specifics — a file path becomes "the auth module," an exact error becomes "a type error." And compressed noise is still noise.</p>

<p>Worse: compaction keeps you near the ceiling. You compact from 90% down to 70% and you're still in the dumb zone. You never get back to clean.</p>

<h2>The fix: <code>/clear</code> often</h2>

<p>The counterintuitive move: throw it away. Start fresh. A new context with just your CLAUDE.md and the current task puts you at maybe 5% of capacity — deep in the high-quality zone. Your project instructions reload at the top of the window, exactly where the model pays the most attention.</p>

<p>Clear between tasks. Clear when the agent starts repeating mistakes. Clear when you feel output quality dipping. It's free and it works better than any clever context management trick.</p>

<h2>The <a href="/lexicon/ralph-loop/">Ralph Loop</a></h2>

<p><a href="https://ghuntley.com/ralph/">Geoffrey Huntley</a> took this idea to its logical extreme with the Ralph Loop — a bash loop that runs a coding agent repeatedly, each iteration getting a fresh context with the full spec reloaded:</p>

<pre><code>while :; do cat PROMPT.md | claude-code ; done</code></pre>

<p>One task per iteration. Fresh context every time. The spec files are the durable part — code is disposable, reshaped every iteration. He documented completing a $50,000 contract for $297 in compute costs using this pattern.</p>

<p>It works because it sidesteps every problem above: no dumb zone, no poisoning, no compaction. Just a clean window, a clear spec, and one focused task.</p>

<h2>This is <a href="/lexicon/context-engineering/">context engineering</a></h2>

<p>The discipline of managing what goes into the window and what gets cut. There's more to it: <a href="/lexicon/back-pressure/">back pressure</a>, spec-driven workflows, sub-agents, tool budgets. But it all starts here — understanding that your context window is smaller than you think, and <code>/clear</code> is your best tool.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Let Your Agent See What It Builds in SAP BAS</title>
      <link>https://siever.ing/agent-sees-sap-bas/</link>
      <guid>https://siever.ing/agent-sees-sap-bas/</guid>
      <pubDate>Sun, 01 Mar 2026 00:00:00 GMT</pubDate>
      <description>How to give your coding agent screenshot access inside SAP Business Application Studio using wdi5 and the Headless Testing Framework.</description>
      <content:encoded><![CDATA[<p>If you let an <a href="/lexicon/agent/">agent</a> write code without seeing the result, it's guessing. <a href="/lexicon/back-pressure/">Back pressure</a> matters — the agent needs to see what it built, verify it works, catch what's off. Same idea as TDD: don't trust the output, check it.</p>

<p>So I wanted my agent to take screenshots of the UI5 app it was building in SAP Business Application Studio. Simple ask. Except BAS runs inside a Kubernetes container with no display server, no Chrome, and missing system libraries. Puppeteer fails. Playwright fails.</p>

<p>If you've worked in the SAP world, you know the feeling. Half the battle is figuring out what the platform even lets you do.</p>

<p>I spent way too long trying to make Puppeteer work before accepting the obvious. Sometimes the answer isn't "try harder," it's "wrong tool."</p>

<p>The answer is <a href="https://wdi5.dev">wdi5</a> — a WebdriverIO plugin built for UI5 apps — combined with a BAS plugin most people don't know about: the Headless Testing Framework.</p>

<h2>Enable the plugin first</h2>

<ol>
  <li>Go to SAP BTP Cockpit → your subaccount → Business Application Studio</li>
  <li>Stop your dev space</li>
  <li>Edit → Additional SAP Extensions → check "Headless Testing Framework"</li>
  <li>Save and restart</li>
</ol>

<p>This installs Firefox ESR and geckodriver into the container. That's your browser runtime. The only one you get.</p>

<p>Four steps. A checkbox. That's what stood between me and a working screenshot pipeline.</p>

<h2>Project setup</h2>

<p>Once the plugin is active:</p>

<pre><code>npm init wdi5@latest</code></pre>

<p>That scaffolds the config, test directory, and an npm script. Screenshots are one line:</p>

<pre><code>await browser.saveScreenshot("./screenshots/01-home-page.png");</code></pre>

<p>Now your agent can take screenshots inside BAS. It writes code, runs the tests, sees the result. Actual back pressure.</p>

<p>Next step: wrapping this into a Claude Code skill so the agent triggers it on its own. But that's a different post.</p>
]]></content:encoded>
    </item>
    <item>
      <title>The Context Reset Rhythm That Actually Works</title>
      <link>https://siever.ing/context-reset-rhythm-that-actually-works/</link>
      <guid>https://siever.ing/context-reset-rhythm-that-actually-works/</guid>
      <pubDate>Wed, 25 Feb 2026 00:00:00 GMT</pubDate>
      <description>Long sessions degrade. A deliberate reset cadence keeps output quality stable without overthinking memory tricks.</description>
      <content:encoded><![CDATA[<p>I stopped trying to save every token in one endless session. Reset rhythm works better.</p>

<p>Models degrade as context grows. You see skipped instructions, repeated mistakes, and rising <a href="/lexicon/hallucination/">hallucination</a> risk.</p>

<h2>The cadence</h2>

<p>I split work into short chunks with explicit reset points:</p>

<ol>
  <li>Finish one concrete task.</li>
  <li>Write a short checkpoint note in repo files.</li>
  <li>Run <code>/clear</code> and reload only current context.</li>
  <li>Start the next task with fresh state.</li>
</ol>

<p>This keeps sessions out of the <a href="/lexicon/dumb-zone/">dumb zone</a> and cuts prompt drift.</p>

<h2>What to keep between resets</h2>

<p>Keep only durable artifacts: specs, todo list, failing test output, and current branch state.</p>

<p>Do not keep long chat debates. They are prime <a href="/lexicon/context-poisoning/">context poisoning</a>.</p>

<h2>Why this beats memory tricks</h2>

<p>People try to solve this with giant memory layers and compaction chains. That helps a bit, then falls apart again.</p>

<p>A clean reset with clear written state is simpler and more reliable in day-to-day coding.</p>

<h2>Use rhythm, not hero sessions</h2>

<p>Agent quality is not about one perfect prompt. It is about repeatable operating rhythm.</p>

<p>Small tasks, hard checks, frequent resets. Boring, and very effective.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Language Choice and Agent Effectiveness</title>
      <link>https://siever.ing/language-choice-agent-effectiveness/</link>
      <guid>https://siever.ing/language-choice-agent-effectiveness/</guid>
      <pubDate>Sat, 14 Feb 2026 00:00:00 GMT</pubDate>
      <description>Some stacks are easier for agents to navigate, test, and change safely. Language choice now has an automation multiplier.</description>
      <content:encoded><![CDATA[<p>I used to treat language choice as a team preference topic. With coding <a href="/lexicon/agent/">agents</a>, it is now also an execution topic.</p>

<p>Some stacks give cleaner errors, faster tooling, and easier static checks. Agents thrive there.</p>

<h2>What helps agents most</h2>

<ol>
  <li>Fast code search and predictable file layout.</li>
  <li>Strong type feedback from compiler or checker.</li>
  <li>Small, reliable test commands.</li>
  <li>Clear dependency and build scripts.</li>
</ol>

<p>None of this is new for humans. Agents just magnify the gap between disciplined and messy repos.</p>

<h2>Why static feedback matters</h2>

<p>Good type errors are immediate <a href="/lexicon/back-pressure/">back pressure</a>. They reduce fake confidence and shorten repair cycles.</p>

<p>In weakly checked stacks, the model can look correct for a long time before runtime proves otherwise.</p>

<h2>Tooling quality is part of language choice</h2>

<p>A language with slow or fragile tooling hurts agents more than humans. Every retry costs tokens and time.</p>

<p>I now score stack choices by automation friendliness, not only developer familiarity.</p>

<h2>No silver bullet, just trade-offs</h2>

<p>You can run agents in almost any language. But if you want high throughput, pick ecosystems with tight feedback loops.</p>

<p>Language still matters. The difference now is who benefits first: your automation pipeline.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Research, Plan, Implement: Ship Faster With Agents</title>
      <link>https://siever.ing/research-plan-implement-ship-faster/</link>
      <guid>https://siever.ing/research-plan-implement-ship-faster/</guid>
      <pubDate>Fri, 06 Feb 2026 00:00:00 GMT</pubDate>
      <description>Splitting work into research, planning, and implementation sessions produces cleaner code and fewer retries.</description>
      <content:encoded><![CDATA[<p>Most agent sessions fail because we mix discovery and implementation in one pass. The model keeps changing direction mid-run.</p>

<p>I now split work into three steps: research, plan, implement. It looks slower. It ships faster.</p>

<h2>Step 1: research only</h2>

<p>Read files, map constraints, find existing patterns. No edits yet. This keeps the model from guessing architecture.</p>

<p>I ask for references with file paths and exact commands. Evidence first.</p>

<h2>Step 2: plan in writing</h2>

<p>Turn findings into a short execution plan with risk notes and test impact. This becomes durable context for the next run.</p>

<p>That document is part of <a href="/lexicon/context-engineering/">context engineering</a>. It reduces drift when sessions reset.</p>

<h2>Step 3: implement with checks</h2>

<p>Now edit code, run tests, fix failures. Keep the loop tight: change, verify, commit.</p>

<p>If quality drops, clear context and continue from the written plan instead of arguing in chat.</p>

<h2>Why it works</h2>

<p>Separating thinking modes lowers contradiction in the prompt. The model does not have to infer whether it should explore or execute.</p>

<p>You get less thrashing, fewer reverts, and clearer commit history.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Build a Ralph Loop From Scratch</title>
      <link>https://siever.ing/build-a-ralph-loop-from-scratch/</link>
      <guid>https://siever.ing/build-a-ralph-loop-from-scratch/</guid>
      <pubDate>Wed, 28 Jan 2026 00:00:00 GMT</pubDate>
      <description>A clean loop with fresh context on every iteration gives agents more consistent output than one long chat session.</description>
      <content:encoded><![CDATA[<p>The <a href="/lexicon/ralph-loop/">Ralph Loop</a> idea is simple: short autonomous runs, fresh context each time, strict verification in between.</p>

<p>It sounds almost too basic. It works because it avoids the long-session failure modes we keep seeing.</p>

<h2>Minimal setup</h2>

<p>You need three files:</p>

<ol>
  <li>A spec file with goal, constraints, and done criteria.</li>
  <li>A task input file for the current iteration.</li>
  <li>A verify script that can fail hard on bad output.</li>
</ol>

<p>Then run loop, verify, and only keep good changes.</p>

<h2>Why this beats marathon sessions</h2>

<p>Each iteration starts clean, so <a href="/lexicon/context-poisoning/">context poisoning</a> does not accumulate. You also stay far from the <a href="/lexicon/dumb-zone/">dumb zone</a>.</p>

<p>The model does one focused unit of work, not twenty mixed goals in one thread.</p>

<h2>Where people get stuck</h2>

<p>Most failures come from weak verification. If your checks are soft, the loop just ships bad work faster.</p>

<p>Strong <a href="/lexicon/back-pressure/">back pressure</a> is the point. Tests, lints, build, optional screenshots. No green, no merge.</p>

<h2>Start small</h2>

<p>Run this pattern on one feature branch first. Keep iteration scope tiny. One task that fits in a short session.</p>

<p>Once that is stable, increase autonomy. Do not start with a giant all-day loop.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Lost in the Middle: A Practical Playbook</title>
      <link>https://siever.ing/lost-in-the-middle-practical-playbook/</link>
      <guid>https://siever.ing/lost-in-the-middle-practical-playbook/</guid>
      <pubDate>Fri, 16 Jan 2026 00:00:00 GMT</pubDate>
      <description>The middle of long prompts gets ignored. This playbook keeps key instructions where models actually pay attention.</description>
      <content:encoded><![CDATA[<p>The <a href="/lexicon/lost-in-the-middle/">lost-in-the-middle</a> effect is not theory anymore. You can see it in daily agent sessions.</p>

<p>Important rules buried halfway down a long prompt get skipped. Then people blame the model. Most times, this is placement, not intelligence.</p>

<h2>Where instructions should live</h2>

<p>Put non-negotiables at the top: coding rules, forbidden paths, test commands. Repeat critical items near the end if needed.</p>

<p>Middle sections should hold disposable detail. If that part gets weaker attention, nothing critical breaks.</p>

<h2>My prompt layout</h2>

<ol>
  <li><strong>Top:</strong> constraints and completion bar.</li>
  <li><strong>Middle:</strong> context notes, references, optional detail.</li>
  <li><strong>Bottom:</strong> current task and explicit next command.</li>
</ol>

<p>That shape follows how attention behaves across a long <a href="/lexicon/context-window/">context window</a>.</p>

<h2>When drift starts</h2>

<p>If output quality drops, I do not add more text. I reset and reload. Fresh context beats rescue prompting.</p>

<p>This is why <a href="/lexicon/compaction/">compaction</a> often disappoints. You stay near the noisy zone and hope for miracles.</p>

<h2>One practical rule</h2>

<p>If a rule matters, it should appear in a durable file and in the active task prompt. Redundancy on purpose.</p>

<p>That one habit cut repeat mistakes across my sessions more than any model upgrade.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Tool Overload: The Hidden Cost of MCP Servers</title>
      <link>https://siever.ing/tool-overload-hidden-cost-of-mcp-servers/</link>
      <guid>https://siever.ing/tool-overload-hidden-cost-of-mcp-servers/</guid>
      <pubDate>Wed, 07 Jan 2026 00:00:00 GMT</pubDate>
      <description>More tools do not automatically mean better agents. Every tool definition consumes tokens and increases failure surface.</description>
      <content:encoded><![CDATA[<p>Adding tools feels like progress. Until the agent starts slowing down and missing obvious instructions.</p>

<p>Every tool definition lives in the prompt budget. In a crowded <a href="/lexicon/context-window/">context window</a>, that cost is real.</p>

<h2>The silent tax</h2>

<p>People count tool capability, not tool overhead. I count both. A tool that gets used once a week should not sit in every session.</p>

<p>Unused tools are just token rent. You pay it on every turn.</p>

<h2>What too many tools break</h2>

<ol>
  <li>Selection quality drops. The model picks the wrong tool more often.</li>
  <li>Latency rises. Bigger prompts mean slower calls.</li>
  <li>Instruction focus drops in the middle of the window.</li>
  <li>Debugging gets harder because failure paths multiply.</li>
</ol>

<p>This is the same pattern as bloated microservice APIs. More surface, more mistakes.</p>

<h2>My trim policy</h2>

<p>I keep only tools tied to current work. Everything else stays disabled. If a tool cannot prove weekly value, it gets removed.</p>

<p>Then I wrap core actions in stable CLI commands with clear IO. Better <a href="/lexicon/tool-use/">tool use</a>, less improvisation.</p>

<h2>Small set, strong checks</h2>

<p>A lean tool set plus strong <a href="/lexicon/back-pressure/">back pressure</a> beats a giant toolbox every time. Fewer choices, better execution.</p>

<p>Agents are not blocked by missing tools. They are blocked by unclear systems.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Sub-Agents and Context Isolation</title>
      <link>https://siever.ing/sub-agents-and-context-isolation/</link>
      <guid>https://siever.ing/sub-agents-and-context-isolation/</guid>
      <pubDate>Fri, 19 Dec 2025 00:00:00 GMT</pubDate>
      <description>One huge context is fragile. Splitting work into focused sub-agents keeps quality high and sessions predictable.</description>
      <content:encoded><![CDATA[<p>One giant agent session looks productive until it collapses. Too many file reads, too many failed attempts, too much stale intent in one thread.</p>

<p>I get better results with sub-agents. Each one gets a narrow scope, short runtime, and explicit output format.</p>

<h2>Main agent as scheduler</h2>

<p>The primary <a href="/lexicon/agent/">agent</a> should coordinate, not do everything. It delegates search, refactor prep, or test analysis to smaller workers.</p>

<p>This keeps the main <a href="/lexicon/context-window/">context window</a> clean. It also lowers the chance of <a href="/lexicon/hallucination/">hallucination</a> from old noise.</p>

<h2>Where sub-agents help most</h2>

<ol>
  <li>Large codebase search and summarization.</li>
  <li>Test failure triage with exact repro notes.</li>
  <li>Migration prep where one module at a time is safer.</li>
  <li>Documentation extraction from many files.</li>
</ol>

<p>These tasks create big token footprints. Isolation keeps that load out of the main flow.</p>

<h2>Isolation rules I use</h2>

<p>Give each sub-agent one question. One output contract. One stop condition. No open-ended "go explore." That is how sessions stay fast.</p>

<p>I also merge only verified outputs back into the main thread. Otherwise you import <a href="/lexicon/context-poisoning/">context poisoning</a> from failed runs.</p>

<h2>The payoff</h2>

<p>You get fewer surprises, cleaner diffs, and easier review. The big win is not speed. It is predictability.</p>

<p>Context isolation feels like overhead at first. After a week, it feels like basic safety.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Specs Over Code: The PRD Is the Asset</title>
      <link>https://siever.ing/specs-over-code-prd-is-the-asset/</link>
      <guid>https://siever.ing/specs-over-code-prd-is-the-asset/</guid>
      <pubDate>Fri, 12 Dec 2025 00:00:00 GMT</pubDate>
      <description>Code changes every hour in agentic workflows. The durable value is the spec that survives resets and handoffs.</description>
      <content:encoded><![CDATA[<p>In agentic projects, code is temporary. The spec is durable.</p>

<p>I know that sounds backwards. We used to treat code as the source of truth. With <a href="/lexicon/agent/">agents</a>, code gets reshaped constantly. If your spec is weak, every run drifts.</p>

<h2>What survives a reset</h2>

<p>When I run <code>/clear</code>, the model loses chat history. It does not lose my repo files. So I put important decisions in spec docs, not in chat messages.</p>

<p>A good spec carries intent across sessions. It keeps quality stable even when <a href="/lexicon/context-poisoning/">context poisoning</a> starts showing up.</p>

<h2>What I put in the spec</h2>

<ol>
  <li><strong>Goal:</strong> one outcome, written in plain language.</li>
  <li><strong>Constraints:</strong> files to avoid, APIs to keep, failure conditions.</li>
  <li><strong>Checks:</strong> exact commands that must pass before done.</li>
  <li><strong>Done criteria:</strong> what "shipped" means in this repo.</li>
</ol>

<p>That is enough for reliable execution. Anything extra is usually noise.</p>

<h2>Prompt-driven work does not scale</h2>

<p>Prompt-only flows feel fast on day one. By day ten nobody remembers why a rule exists. Teams re-explain the same thing in every session and burn tokens doing it.</p>

<p>Spec-driven flow fixes that. The model reads stable docs, not random chat leftovers. That is core <a href="/lexicon/context-engineering/">context engineering</a>.</p>

<h2>Write for execution, not for slides</h2>

<p>Most PRDs are written for humans in meetings. Agent-readable specs are different: direct language, short sections, concrete commands, zero hand-wavy text.</p>

<p>If an agent can run your task from spec and pass checks, your spec is good. If not, rewrite it before touching code.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Back Pressure: How Agents Know They&#39;re Wrong</title>
      <link>https://siever.ing/back-pressure-how-agents-know-theyre-wrong/</link>
      <guid>https://siever.ing/back-pressure-how-agents-know-theyre-wrong/</guid>
      <pubDate>Fri, 05 Dec 2025 00:00:00 GMT</pubDate>
      <description>Agents do better work when the system pushes back with hard checks instead of polite guesses.</description>
      <content:encoded><![CDATA[<p>An <a href="/lexicon/agent/">agent</a> without feedback is just a fast guesser. It writes code, sounds confident, and misses the bug right in front of it.</p>

<p>The fix is <a href="/lexicon/back-pressure/">back pressure</a>. Force reality into the loop. Tests fail. Types fail. Lint fails. Screenshots fail. The agent adapts or it stops.</p>

<h2>Confidence is cheap</h2>

<p>LLMs are trained to continue text, not to prove truth. So if your setup only asks for a diff, you get polished nonsense surprisingly often.</p>

<p>I stopped asking agents for "clean code." I ask for passing checks. Same task, better output.</p>

<h2>Four pressure points that work</h2>

<ol>
  <li><strong>Types:</strong> static checks catch structural mistakes before runtime.</li>
  <li><strong>Tests:</strong> behavioral checks catch wrong assumptions.</li>
  <li><strong>Linters:</strong> style and risky patterns get flagged early.</li>
  <li><strong>UI evidence:</strong> screenshots show if the interface is actually right.</li>
</ol>

<p>This is just engineering hygiene. In an agent flow, it becomes non-negotiable.</p>

<h2>Back pressure beats longer prompts</h2>

<p>When output drifts, most people add more prompt text. I do the opposite. Keep instructions short, strengthen checks.</p>

<p>Prompt text can be ignored in a crowded <a href="/lexicon/context-window/">context window</a>. A failing test cannot be ignored. It blocks progress immediately.</p>

<h2>The practical loop</h2>

<p>My default run looks like this: task, diff, tests, patch, repeat. If the agent keeps failing, I <code>/clear</code>, reload spec, and run again with fresh <a href="/lexicon/context-engineering/">context engineering</a>.</p>

<p>That one change made agent sessions calmer and cheaper. Less debate with the model. More verifiable output.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
