grovestack

Your Org Chart Is About to Change

Sat, 18 Apr 2026 00:00:00 GMT

Your org chart still assumes humans are the implementation layer.

That assumption is about to get expensive.

Every software org was designed around one idea: people write the code. Now AI can.

Why the old org chart looks the way it does

Every org chart you have ever worked inside was drawn to solve one scarcity problem: implementation capacity. Humans type code slowly, so you needed a lot of them. You needed managers to coordinate the lot of them, tech leads to hold the interfaces together, and engineering managers to hold the tech leads together. Each box on the chart is a claim on scarce implementation hours.

That scarcity has collapsed. A coding agent with a well-written specification generates in an afternoon what used to take a two-week sprint. The engineers OpenAI described in their February 2026 case study — three, then seven — shipped roughly one million lines of production code in five months. They were not staffed for implementation at all. They were staffed for specification, delegation, and validation. Throughput grew as the team grew, in direct contradiction to Brooks' Law, because the coordination cost sat in the harness rather than in meetings.

If implementation is no longer the bottleneck, the chart drawn around that bottleneck is the wrong chart.

The ratio of architects to implementers is inverting.

What the new role actually does

Call it the Architect-CEO. The name is the job description.

An Architect-CEO does three things, over and over, on every unit of work:

Specifies what should be built — the input contract, the output contract, the acceptance criteria, the edge cases.
Directs the system that builds it — assigning the task to a coding agent, or a fleet of them, with the context and constraints the agent needs to succeed.
Validates that what the system built is correct — through automated checks, not by hand-reading diffs.

That is the loop. It is the same loop a CEO runs inside a company: define outcomes, delegate, verify. The coding agent is the executor. The architect is the one deciding what "done" means and who does what.

The discipline that makes this role work is not typing speed. It is the same set of coordination primitives that structure any high-functioning team, compressed into one person's operating model: discovery (which agent, with which capability, is available?), messaging (what structured context crosses the boundary between you and that agent?), and delegation (who has the task, how long do they have, what happens when they fail?). Every multi-agent system must solve those three concerns. Every new-shape engineering org will too, whether the leadership names it that way or not.

The structural implications — not cosmetic ones

This is not a rebrand. The operating model actually changes.

Specifications replace handoffs. The thing a junior engineer used to receive from a senior engineer — a Jira ticket, a paragraph in Slack, a whiteboard sketch — becomes a versioned Markdown file with explicit contracts. The agent reads the spec; the spec is the handoff. There is no drift between what was asked for and what was implemented, because the asked-for is the document the implementation is regenerated from.

Validation replaces supervision. The thing a tech lead used to do — read a pull request, eyeball the code, approve it — becomes an automated pipeline: type checks, tests, security scans, maintainability reviews, performance profiles, each running in parallel on every diff. A human reviewer gates escalations. The pipeline gates everything else. The validation pipeline is the supervision layer.

Orchestration replaces inter-team coordination. The thing a program manager used to do — wire three teams together, manage the dependencies, chase the status updates — becomes a workflow graph with typed messages crossing its edges. Fan-out to parallel agents. Reducers that merge their outputs. Correlation IDs that trace every handoff end-to-end. The orchestrator runs the coordination protocol. The protocol is the program manager.

Anthropic's multi-agent research made the mechanism concrete. A lead agent decomposed a complex task into subtasks, each specified with an explicit objective, output format, and boundary. Subagents executed in parallel and returned condensed findings. Versus a single agent on breadth-first queries, the result was a 90% performance uplift — but only when the handoffs between agents were typed and structured. When the same team tried unstructured natural-language handoffs, the coordination quality collapsed. Structure was not overhead. Structure was the reason it worked.

None of this means fewer people. It means different people, doing different work, with dramatically more leverage per head.

What "hiring for the old org chart" looks like

Job reqs counted in "engineering years" rather than specification and validation fluency.
Seniority ladders where the top rung is measured in code review volume or lines shipped.
Orgs hiring ten "AI-assisted developers" when they need one architect-CEO and a harness.
Leadership expecting a productivity uplift from tools without touching the reporting structure, the definition of done, or the way work gets handed off.

If any of those describe your program, you are buying the tool without buying the operating model. The first will not work without the second.

What to do this week

Pick one flow and redraw it as a coordination graph. Who specifies, who directs, who validates, and how does work move between them? If the answer is "a senior engineer does all three by hand," that is the role to redesign.
Stop hiring for implementation. Start hiring for the harness. Your next requisition should be written around the ability to specify behavior, design validation pipelines, and reason about delegation — not around languages and frameworks.
Name the architect-CEO role on a real team and measure what they ship. Not lines of code. Specifications authored. Workflows orchestrated. Validation pipelines tightened. Time-to-deployment for a typical feature. The metrics have to change for the operating model to change.

The operating model, not the tool

Instead of assigning five engineers to build a feature, one person defines the spec, sets constraints, and evaluates outputs from AI systems.

Specifications replace handoffs. Validation replaces supervision. Orchestration replaces the coordination that used to happen between people.

This is not downsizing. It is an operating model shift.

Engineers who learn to specify, direct, and validate become dramatically more valuable. But only if leadership redesigns the system around that role.

AI doesn't change software economics on its own. The org chart does.

Are you still hiring for the old org chart?

Why Your AI Pilot Failed

Wed, 15 Apr 2026 00:00:00 GMT

Seven out of ten AI pilots never make it to production.

The models keep getting better. The failure rate has not moved.

That is not a model problem. It is an engineering problem.

Why most pilots skip the engineering step

Every pilot playbook treats model access as the hard part. Procure the API key, enable the seats, run a kickoff, collect feedback. The technology is the point; the operating discipline is whatever the team figures out along the way.

That order is inverted. The model is a commodity. The discipline around it is not.

Coding agents amplify whatever you feed them. A vague request produces confident, plausible output that is subtly wrong in ways a human reviewer will miss on a tight deadline. A written specification with explicit input and output contracts, acceptance criteria, and boundary behavior produces output you can validate against something concrete. Same model, same week — different economics.

Pilots fail when teams treat prompts as instructions and outputs as deliverables. Prompts are ephemeral. They live in a chat window and disappear when the conversation ends. A specification is a versioned file, revised when requirements change, diffed when implementations drift, and reused every time the code is regenerated. One is a demo. The other is a system.

What pilot-failure mode looks like from the outside

Demos are impressive. Second runs are inconsistent. Nobody can reproduce last week's good result.
Rework costs climb. Every agent output requires a human to clean it up before it lands.
The team blames the model. Every conversation ends with "this will get better when the next release drops."
There is no artifact you could hand to a new engineer that would let them run the same loop tomorrow.
Leadership concludes the technology is not ready. What actually failed is the operating model around it.

If any three of those describe your program, you are not running a pilot. You are running a demo on loop.

What the harness actually looks like

The harness is not a tool. It is the discipline that makes good output repeatable.

In February 2026, OpenAI published a case study on a team of three engineers — later seven — that shipped roughly one million lines of production code in five months without writing a line by hand. Every line was generated by coding agents. The engineers' job was what they called harness engineering: designing the codebase structure, writing specifications, encoding architectural constraints as automated rules, and building validation pipelines that caught agent mistakes before they reached production.

Three things define that harness in practice.

A specification as the source of truth. For every component the agent touches, there is a Markdown file in version control that defines the input contract, the output contract, the acceptance criteria, and the edge cases. When the output is wrong, you fix the specification, not the generated code. Fixing the code addresses the symptom. Fixing the specification addresses the cause — and every future generation respects the fix.

Deterministic validation on top of probabilistic generation. The agent's interpretation of the specification is probabilistic. The validation that follows is not. Test suites return pass or fail. Type checkers accept or reject. Linters flag or approve. A guard condition gates deployment: if any check fails, the implementation is rejected and the specification is revised. The spec defines intent. The tools enforce it.

Parallel, multi-dimensional review. Where a traditional team runs sequential human reviews bounded by attention and deadline pressure, a harnessed team fans out analysis agents in parallel: security review, performance profiling, test coverage, maintainability. A reducer merges the findings. Nothing skips the performance check because someone is tired on Friday. The review is more thorough than human review because it runs every analysis every time, with no fatigue and no shortcuts.

That is the difference between "we have an AI pilot" and "we have a production AI system." It is not a model. It is a harness.

What to do Monday

Pick one component from your pilot and write a specification before anything else. Input contract, output contract, acceptance criteria, three example input-output pairs, two edge cases. Direct the agent to implement from the spec. Validate the output against the spec. That is one turn of the loop you should be running on every piece of work.
Add automated validation that the agent must pass, not you. Tests, type checks, linters, a security scan. Make the validation pipeline the gate. If a human has to read every diff line by line before it lands, you have not industrialized anything.
Track iterations-per-specification. Count how many times you revise a spec before its output passes. That number tells you whether your specifications or your validation are the weakest link, and it replaces "how's the pilot going?" with a metric leadership can actually read.

The discipline, not the demo

Harness, not hope.

The organizations that scale AI are not the ones with the most pilot activity. They are the ones engineering the system that makes good output repeatable.

AI changes software economics only when leadership changes the operating model around how these systems are built, validated, and scaled.

Are you still hoping the next model will fix it?

Your AI Is Only As Good As Its Context

Tue, 14 Apr 2026 00:00:00 GMT

Garbage in. Garbage out.

AI didn't change that. It multiplied it.

Same model. Same capabilities. Completely different results.

Why? Because AI doesn't "understand." It predicts — based on the context you give it.

It's not a model problem. It's a context problem.

Most teams think bad AI output means the model failed. It didn't. Every large language model generates text by predicting the next token based on what's already in its context window. The attention mechanism inside the model assigns weights to each token by relevance, and those weights determine what the output reflects. What reaches the context window determines what gets attended to. What gets attended to determines what comes out.

That is not a metaphor. It is the mechanism.

Give the model vague input and it doesn't stop. It fills in the gaps with assumptions pulled from patterns in its training data, and it keeps going. The output is fluent. It is confident. It is also, often, wrong. That is where "confidently wrong" comes from: the model filling in missing context with plausible-looking inventions, then sounding sure about them.

Give the same model structured context — clear requirements, explicit constraints, precise definitions, examples of what correct looks like — and the attention mechanism concentrates. The output sharpens. Same AI. Different discipline. Different outcome.

This isn't about better prompts. It's about better inputs.

The word "prompt" is part of the problem. It implies a single request into a chat window. A prompt is one sentence. An agent's context is everything the model sees at the moment of generation.

Context is layered:

System instructions — the durable framing and rules that govern how the agent should behave.
Data — documents, source code, prior outputs, relevant files retrieved from the codebase or a knowledge base.
User input — the current request, the current task, the question being asked right now.
State — accumulated results from prior steps in the workflow: what was decided, what was produced, what happened.

Layer by layer. If those layers are vague, incomplete, or misaligned, the output won't just be wrong — it will be confidently wrong. If those layers are structured and intentional, the model locks in, and the output sharpens.

Most teams are still treating AI like a search box — typing requests and hoping for good results. The teams pulling ahead are doing something different: they're engineering the context. They decide what the model sees, how it's structured, and what "correct" looks like before generation starts.

State is context

There is one layer of the context stack that most organizations underinvest in — and it is the layer that separates one-shot demos from production systems.

When a coding agent runs a real task, it rarely finishes in a single step. It reads a spec, generates code, runs tests, reads the output, revises, and repeats. Each step produces data the later steps need. The accumulated result — which files were written, which tests passed, which validations fired, which decisions were made — becomes the context for the next step.

In other words, the context at step seven is everything the workflow has done by step six. If that state is well-structured, correctly typed, and carries a provenance record of who produced what and when, the agent at step seven has a precise picture of the task so far. If state is a pile of untyped globals, temp files, and mystery strings, the context is degraded before the next step starts — and the failure surfaces three steps downstream with an unrelated error message.

This is why the harness treats state as a first-class artifact. Typed accessors catch type mismatches the moment they happen, not five steps later. Versioned state lets you query what a field contained at any point in the workflow. A mutation log records every change — which field, what value, which step, what time. Provenance is not compliance theater; it is the debugging story for agent workflows, and it is what makes the context reliable as the work compounds.

Good state management is good context management. The two are the same discipline at different timescales.

Prompts vs. assembled context

Here is the operational shift most teams haven't made yet:

They're doing	The teams pulling ahead are doing
Writing prompts	Assembling context
Optimizing wording	Structuring inputs
Hoping for good output	Validating against contracts
Treating state as session noise	Treating state as a versioned asset
Fixing bad output by re-prompting	Fixing bad output by improving context

The teams in the right column ship systems. The teams in the left column ship demos.

What to do this week

Audit your most-used agent workflow for its context stack. What system instructions govern it? What data does it retrieve? What user input triggers it? What state does it accumulate? If any layer is missing or ad-hoc, that is where your "confidently wrong" output is coming from.
Replace one untyped field in that workflow with a typed accessor. If a step reads a variable expecting a string and gets a number, the failure should happen at the read with a clear error — not three steps later with a mystery crash.
Start a mutation log for the next iteration. Every change to workflow state: which field, what value, what step, what time. When the output is wrong, you'll trace it in minutes instead of re-running the workflow to reproduce.
Stop rewarding clever prompts. Start rewarding well-engineered context. Promotion criteria, code review, engineering demos — all of it. What you measure is what you get.

The shift, one more time

This isn't about better prompts. It's about better inputs.

Same AI. Different inputs. Different outcomes.

Engineer your context.

They Execute. You Decide.

Sun, 12 Apr 2026 00:00:00 GMT

The goal isn't for coding agents to write the code faster.

The goal is to give them better direction.

The teams pulling ahead aren't using AI as a tool. They're using it as a system.

Each agent has a role. Each step has a contract. Each output is validated.

The work gets done. They execute. You decide.

"A system, not a tool" is a load-bearing sentence

Most organizations have landed on AI the way they land on any new tool: pilot it, train a few people, watch the demos, decide it's promising. A coding agent becomes one more thing on the stack — next to the IDE, next to the CI pipeline, next to the feature flag service.

That framing is why the gains stay small. A tool is something a person picks up to do a task. A system is something the organization operates, with defined roles, defined interfaces, and a protocol for how work moves through it. The first gives you a productivity bump for the person holding the tool. The second changes how work gets done.

The distinction is the entire game.

Each agent has a role

Start with the word role. In a tool view of AI, there is no such thing — the agent is a generic helper that does whatever the prompt asks for. In a system view, each agent is registered with a capability set: code review, test generation, database migrations, security scanning, infrastructure provisioning, whatever the work requires. The orchestrator queries an agent registry to find agents that match the task. If no agent has the capability, the delegation fails immediately, not three steps later with an unrelated error.

This looks simple in isolation and it is decisive in the aggregate. A registry turns "which agent should I ask?" from a judgment call into a lookup. It also turns "is the agent still alive?" from a hope into a heartbeat. Production multi-agent systems — the ones that actually ship rather than demo — run this pattern at the boundary of every delegation, because a registry that doesn't reflect reality is worse than no registry at all. An agent marked available that crashed five minutes ago is a landmine.

The role, in other words, is not a slide in a deck. It is a structured entry in a system the orchestrator queries on every task.

Each step has a contract

Now the word contract. When a specification-grade role passes work to another role, it does not pass a paragraph of natural language and hope for the best. It passes a typed message with explicit fields: a unique ID, a topic, a sender, a correlation ID that links related messages across the workflow, and a payload validated against a schema registered for that topic.

If the payload is missing a required field or the types are wrong, the message is rejected at the boundary. The receiving agent never sees malformed data. The failure is loud, early, and traceable to the sender — not silent, late, and blamed on whoever was nearest when the crash happened.

The shift here is not pedantic. Anthropic's multi-agent research found that structured handoffs between agents produced a 90% performance uplift over a single agent running the same breadth-first task — but only when the handoffs were typed. When the same team tried unstructured natural-language handoffs, the coordination quality collapsed. Structure was not overhead. Structure was the reason it worked.

Every step having a contract also means every step is debuggable. Typed messages have IDs, timestamps, and correlation IDs. When something goes wrong in a multi-agent workflow, you trace the exact sequence of messages that led to the failure. Try doing that with unstructured text passed between chat windows. You can't. The difference is the difference between engineering and storytelling.

Each output is validated

The third word is validated. Before any agent's output crosses a boundary — into production, into another agent's context, into a decision that affects a human — an automated pipeline runs deterministic checks: tests, type checks, linters, security scans, performance profiles, coverage analysis. Pass or fail. Not opinions.

In a tool view of AI, validation is a human reviewer eyeballing a diff. In a system view, validation is a pipeline that runs on every output, every time, with a gate condition: if any check rejects, the implementation is rejected and the step retries against a revised specification. Humans gate the escalations — destructive boundaries, production deploys, schema changes, anything irreversible. The pipeline handles everything else.

This is how probabilistic generation becomes deterministic behavior. The model can be creative. The validator is not. The creativity is caged by a contract the code must satisfy before it ships.

They execute. You decide.

Put the three concepts together and you have the operating model the sentence is pointing at.

Roles make work assignable — a registry answers "who can do X?" in one query.
Contracts make handoffs reliable — typed messages carry what the next step needs, validated before delivery.
Validation makes outputs gateable — automated checks decide pass or fail before anything ships.

The person in the center of this system is not typing faster. They are doing what a CEO does inside a company: defining outcomes, delegating, verifying. The coding agents are the executors. The human decides what "done" means, which validations gate which steps, and where the escalation boundaries sit.

That is the shift. Not "AI writes code faster." The organization writes software differently.

What to do this week

Name the roles in your most-used AI-assisted workflow. If every task goes to "the AI" rather than to a role with declared capabilities, you have a tool, not a system. The fix is to decompose: code review, test generation, spec authoring, security scan — each with a capability label the orchestrator can look up.
Turn one handoff into a contract. Pick one step where work passes between humans (or between a human and the agent) via Slack messages, tickets, or ad-hoc paragraphs. Replace it with a typed payload: required fields, validated schema, correlation ID. Watch how much downstream rework disappears.
Put a validation gate on one output. Not a review meeting. A pipeline. Pass/fail before the output crosses the next boundary. Measure the defect rate before and after. That is your operating-model-change metric.

The shift, one more time

Each agent has a role. Each step has a contract. Each output is validated.

The work gets done. They execute. You decide.

AI doesn't change software economics by writing code faster. It changes when leadership changes the operating model.

Fix the Spec, Not the Code

Fri, 10 Apr 2026 00:00:00 GMT

There's a name for how most teams are using AI right now.

Vibe coding.

You type something. It gives you something back. You look at it and think: "This is wrong." So you fix it. Then you do it again. And again.

That loop feels like progress. It isn't. It's rework.

The uncomfortable truth

The problem usually isn't the code. It's the specification.

AI doesn't misunderstand requirements. It fills in gaps. Silently. Confidently. At scale.

When the output is wrong, most teams do the obvious thing: they fix the code. The teams pulling ahead do something different. They fix the spec.

That is the shift: from reacting to output → to controlling it.

Specifications fail in predictable ways

Bad AI output is not a mystery. It is a diagnostic signal. The agent did exactly what the context told it to do. The context was incomplete, ambiguous, or over-prescriptive. There are three failure modes, and after you have seen them named you will recognize them on every team you work with.

Ambiguity. A sentence in the specification can be read two ways. The agent picks one. You intended the other. The code compiles, the tests you wrote pass, and the behavior is simply not what you expected. Words like should, appropriate, handle, and properly are the giveaway — each one delegates a decision to the agent without acknowledging that the decision is being delegated. "The function should parse the date string and return a date object" is the canonical example: which formats, which library, what happens on invalid input? The agent guesses. You lose.

Underspecification. The specification never mentions a requirement at all. The agent does not misinterpret intent; it never receives it. The happy path works. The first edge case you try explodes. "Read a CSV file, return a list of dictionaries" sounds complete right up until the file is empty, or the row has an extra column, or a field contains a quoted comma. Underspecification is the failure mode you only notice at 10:47 PM on a Tuesday, because the missing requirement only matters when someone hits it.

Overspecification. The specification dictates implementation instead of behavior. "Use a dictionary called _cache. Iterate with a for loop. Check the cache before reading the file." The agent follows the instructions faithfully and produces code that is correct, fragile, and locked into decisions that should have stayed flexible. The damage is invisible until you try to modify, port, or extend the system. Then you discover that the spec constrained the mechanism, not the outcome.

There is a fourth failure mode that only appears over time: specification drift. You edit the generated code to fix a bug but forget to update the specification. Or the agent produces output beyond what the spec asked for, and you accept it. The specification and the implementation gradually diverge. The next time you regenerate, a feature you had come to rely on quietly disappears — because the spec, the document that is supposed to be the source of truth, never knew about it.

The diagnostic sequence

Here is the loop the teams pulling ahead run every time validation fails:

Identify the failing behavior. What did the output do wrong? What did it omit? Be specific — "it didn't handle empty input" is a fact, "it felt off" is not.
Locate the specification gap. No relevant sentence exists? Underspecification. A sentence exists but reads two ways? Ambiguity. A sentence dictates a mechanism? Overspecification. The gap classifies the fix.
Classify and fix the spec. Add precision for ambiguity. Add the missing requirement for underspecification. Replace an implementation detail with a behavioral constraint for overspecification.
Regenerate and revalidate. Confirm the fix resolved the failure without breaking anything that already passed.

This is deterministic. You compare observed behavior against the specification, the gap tells you what to fix, and the fix goes into the document — not into the code. After a handful of cycles, your first-pass specifications become substantially more complete, and the rework loop collapses.

What a real specification looks like

A specification that a coding agent can implement reliably has a specific structure. It is not a paragraph of wishes. It is a contract: five components, every one of them present.

Input contract — what the component receives. Types, formats, valid ranges, source. If the input is a file, the file format. If the input is a function argument, the type and constraints.
Output contract — what the component produces. Return types, structure of output files, format of displayed information, success vs. failure behavior. "Returns a list of results" is weak. "Returns a JSON array where each element has id (string), score (float, 0.0–1.0), and label (string)" is strong.
Acceptance criteria — testable assertions. "Given input X, the output must be Y." "When the input is empty, exit code 1 and print 'Error: empty input' to stderr." These are the bridge between the specification and the automated validation pipeline that gates the merge.
Constraints — the boundaries the implementation must respect. Performance budgets. Dependency restrictions. Style requirements. Compatibility requirements. The things that are invisible in the functional description but non-negotiable in production.
Examples — concrete input-output pairs. Examples disambiguate natural language and double as test cases. They are the single fastest way to tell the agent what you actually mean.

A 32-line specification can produce hundreds of lines of correct implementation. That ratio — short specification, long implementation — is why this operating model wins.

What to do this week

Take one piece of code your team hand-edited after an AI generated it. Find the spec that produced it. If there isn't one, that is the finding. If there is, classify the failure: ambiguity, underspecification, overspecification, drift. You now know what to change — in the document, not the code.
Add an "edge cases" section to every spec template on your team. Empty input. Oversized input. Malformed input. Missing fields. Extra fields. The failure mode you don't think about is the one the agent will fill in wrong.
Write the next requirement change as a spec diff, not a prompt. When someone says "it should also do X," update the specification, then regenerate. If you patch the code directly, you have just planted the seed of specification drift.
Make the rule explicit: we do not hand-edit generated code. The specification is the artifact under version control. The code is derivative. Every accepted change lives in the spec first. Every regeneration is a diff against the previous generation, reviewed for silent expansion.

The shift, one more time

Vibe coding feels fast. Reliable systems are fast — because they aren't spending the week re-prompting the same problem.

When output is wrong, fix the spec. Not the code.

Be honest: how much of your team's AI usage is still just fixing output?

3 Engineers, 5 Months, 1,000,000+ Lines

Thu, 09 Apr 2026 00:00:00 GMT

3 engineers.

5 months.

1,000,000+ lines of production code.

Zero written by hand.

Not a demo. Not a prototype. Production systems.

What most people miss

They didn't just "use AI." They built a harness — the infrastructure that makes AI output reliable.

Specifications that defined exactly what to build.
Validation pipelines that caught errors automatically.
Workflows that enforced quality at every step.
Architectural constraints encoded as rules.

Their job wasn't writing code. It was engineering the system that writes code.

And when they scaled from 3 engineers to 7, throughput increased. That breaks 50 years of software engineering assumptions.

The case study, in more detail

The team in question is OpenAI's Codex group, described in a February 2026 engineering retrospective. Three engineers began the project; seven finished it. Five months of elapsed time. Roughly one million lines of production code. None of it hand-typed.

The throughput curve is the part that should stop any engineering leader cold. Brooks' Law, named for Fred Brooks' 1975 observation that "adding manpower to a late software project makes it later," has been the gravitational constant of software engineering for half a century. Communication overhead grows quadratically with team size. Coordination cost eats the gains from additional headcount. Every engineering leader has felt it.

OpenAI's team inverted it. Adding engineers did not slow them down because the coordination cost no longer sat between people. It sat in the harness. New engineers did not have to absorb tribal knowledge, interpret ambiguous tickets, or synchronize their mental models with four existing peers. They read the specifications, contributed to the validation pipelines, and started producing. The harness was the shared context.

That is not a productivity story. It is a structural story about where coordination cost is paid — and what happens when it moves from human-to-human communication into engineered artifacts.

What the harness actually is

A harness is not a tool you buy. It is infrastructure you build, from a small set of primitives that any capable coding agent exposes.

Specifications as the source of truth. Every component the agent builds has a Markdown spec in version control: input contract, output contract, acceptance criteria, edge cases. When the agent's output is wrong, you fix the spec and regenerate. You do not hand-edit the generated code. The spec is the artifact; the code is derivative.

Automated validation as the gate. Between agent output and production sits a pipeline of deterministic tools — test suites, type checkers, linters, security scanners, coverage analyzers, performance profilers. They return pass or fail, not opinions. A guard condition gates deployment: if any check rejects the diff, the implementation is rejected and the spec is revised. The spec defines intent. The validation tools enforce it.

Workflows as the coordination layer. Real tasks are multi-step. Real systems are multi-component. A workflow engine decomposes work into a directed graph: fan-out nodes that run parallel agents, reducers that merge their results, conditional branches that route on outcomes, correlation IDs that trace every decision end-to-end. The workflow graph is your org chart, compressed into a single person's operating model.

Architectural constraints as enforced rules. The shape of the codebase is not left up to the agent. Naming conventions, layering rules, dependency directions, forbidden imports, security boundaries — all of it expressed as automated checks. The rules run on every diff, not just on pull-request day. The agent cannot drift the architecture because the architecture is enforced by tooling, not by reviewer memory.

These are not exotic capabilities. They are the primitives every modern coding agent already exposes: persistent memory that retains context across sessions, file-system access that lets agents read the codebase directly, typed tool calls that route side effects through controllable interfaces, subagent delegation that parallelizes independent work inside a ReAct loop. What the OpenAI team did was assemble those primitives into discipline.

What your engineers actually do in a harnessed organization

If the agent writes the code, the natural question — the one the LinkedIn post ends on — is what the engineers do.

The answer is not "less work." It is categorically different work.

They author specifications. Testable, versioned, precise. The specification is the handoff; they own that handoff.
They design validation pipelines. What invariants must hold? Which checks catch which classes of failure? What is the gate-fail policy?
They build and maintain the harness itself. The workflow graphs, the architectural rules, the context assembly logic. The tool that makes the factory.
They do human-in-the-loop review at destructive boundaries. Production deploys. Schema changes. Anything irreversible. Not line-by-line reading of every diff — reviewing the boundaries where the validation pipeline intentionally stops and escalates.
They design for the novel. The places where the spec can't be written up-front yet — new algorithms, research-grade optimizations, unfamiliar domains — are where exploratory prototyping still happens. The harness handles the known patterns; engineers focus on the edges.

That is a senior technical role. It demands more rigor, not less — requirements engineering, systems design, validation architecture, and security boundaries all compressed into the same job description. The people who can do it are the ones who will staff the next generation of shipping organizations.

What to do this quarter

Pick one output of your team — a feature, a service, a pipeline — and audit it against the four harness components. Is there a spec? Is there automated validation? Is there a workflow graph? Are the architectural constraints machine-enforced? For each "no," you have a concrete place to invest.
Invert the investment ratio. Most organizations are spending on tools (model subscriptions, copilots, IDE plugins) and near-nothing on harness. Flip it. The tools are a commodity; the harness is the moat.
Write a job description for a harness engineer and hire one. Not "AI-assisted developer." Not "prompt engineer." The person who owns the specification patterns, the validation pipelines, the workflow topology, and the architectural rulebook. This is the role that compounds.
Measure systems, not experiments. Defect rate on AI-generated diffs. Time from spec to deploy. Number of specifications under version control. Iterations per spec before acceptance. These are the metrics of a compounding system.

The difference, in one line

Most organizations are investing in AI tools. The ones pulling ahead are investing in the harness.

That's the difference between:

Experiments that stall
Systems that compound

If AI wrote your code tomorrow, what would your engineers actually do?

Can Your Team Direct AI?

Wed, 08 Apr 2026 00:00:00 GMT

The question has changed.

It's no longer whether AI can write code.

It can. That's settled.

The real question is: can your team direct it?

The question most organizations are still asking

"Can AI write code?" was a 2024 question. The answer came in, and it was yes. By now the demos have shipped, the pilots have happened, and every engineering leader has seen a coding agent produce a working feature in an afternoon.

So the 2024 question is settled — and the teams still asking it are two years behind. The operative question is structural, not technical: can your organization consistently get correct, shippable output from these systems on a Tuesday morning, on the thing you actually care about, with the people you actually have?

If the honest answer is "sometimes, depending on who's driving," you do not have an AI strategy. You have an experiment.

What "direct it" actually means

Directing a coding agent is not prompting it. It is engineering the context around it.

Every large language model generates by predicting the next token based on what's in its context window. The transformer's attention mechanism weights the tokens that matter most for the current prediction. What reaches the context window determines what gets attended to, and what gets attended to determines what comes out. That is not a metaphor. It is the mechanism.

A chat interface fills that context window with a conversational sentence and a few system instructions — a small fraction of what the window can hold, and almost never the most useful information for the task. A coding agent operating inside a harness fills the context window deliberately: specifications, the relevant source files, type definitions, test cases, architectural constraints, prior decisions. Same model. Same week. The output quality is not slightly different. It is categorically different.

That is what "directing" means. You are not coaxing a chatbot. You are assembling the structured context that makes the model attend to the right things, the contract the output must satisfy, and the automated validation that decides whether the output is acceptable before it reaches production.

Vibe coding vs. reliable systems

This is the operating gap.

Vibe coding. One engineer, one chat window, one prompt at a time. Context is whatever the human remembered to paste in. Acceptance is the human eyeballing the diff. Reproducibility is whether the same person can get lucky again next week. Good output is a story about an individual, not a property of the system.

Reliable systems. Specifications in version control. Typed contracts that define what "done" means for every component. Automated validation — tests, type checks, linters, security scans, coverage analysis — running in parallel on every diff, gating every merge. Orchestration that routes work to the right agent, tracks every task, and recovers when something fails. Good output is a property of the system. The individual matters, but the system does not depend on heroics.

The difference is not polish. It is reproducibility. "Might" is not engineering. Repeatable results are.

Signs your organization is still vibe-coding

Every good result has a specific person's name attached to it. The same work with a different person goes sideways.
Reviewers read generated diffs line-by-line because there is no automated pipeline they trust to do it for them.
Nobody can answer "how many AI-generated diffs went to production last month, and what was the defect rate?" with a number.
The specification for any given piece of work is a Slack message or a paragraph in a ticket, not a versioned artifact.
Your response to "the output was wrong" is to prompt the model differently, not to fix the specification and regenerate.

If three of those describe your team, the question to ask is not "which model should we use?" It is "what operating model are we asking these tools to live inside?"

What to do this quarter

Name one flow where you need reliability. Not the most exciting one — the most reproducible-critical one. Onboarding, feature delivery, code review, incident response. Something that has to work the same way on Tuesday as it did on Friday.
Write the specification before the prompt. Input contract, output contract, acceptance criteria, examples, edge cases. The spec is the direction you're giving; the prompt is just the vehicle.
Put automated validation between the agent and production. If a human has to eyeball every diff before it lands, you have not industrialized anything. You have just added a new source of diffs for that human to read.
Measure iterations-per-specification. How many times does a spec get revised before its output passes? That number is a management metric that tells you whether your specifications or your validation are the weakest link — and it beats "how's the AI rollout going?" every time.

The shift, not the demo

This is no longer a tooling decision. It's an operating model decision.

The teams pulling ahead aren't using better models. They've built systems — specifications, validation, orchestration — that make AI output reliable and repeatable.

"Might" is not engineering. Repeatable results are.

Has your organization made the shift — or are you still asking the 2024 version?

The Agentic Stack

Wed, 01 Apr 2026 00:00:00 GMT

Most "agentic systems" aren't truly systems. They're monoliths with agent names.

If your team struggles to identify whether a failure stems from a prompt issue, a memory problem, a routing error, or a model capacity limitation, you don't have a stack. You have chaos disguised as an organizational chart.

After building an agentic system from scratch, I've concluded you need two hierarchies, not one.

Two hierarchies, not one

Your environment is declarative. It encompasses compute, models, tools, skills, and agents. You must declare it, version it, and resolve it before any operations commence. Think of it as infrastructure-as-code for your entire capability surface.

Your operational stack is runtime. It involves routing, delivery, encoding, and workflow — executing against the resolved environment.

Many teams conflate the two. The declarative surface and the runtime surface live in the same codebase, the same deploy, the same mental model. A prompt template gets patched. A tool registration gets hand-edited. A model gets swapped. Nobody can say what version of anything produced yesterday's output. That is how debt accumulates — silently, during the weeks when everything seems to be working.

The environment layer (declarative, E0–E4)

The environment is what exists. It is a versioned, inspectable surface you can point at and say: this is the capability set the system can draw from. Five layers, bottom up.

E0 — Compute. Hardware, containers, GPUs. The substrate every agent ultimately runs on.
E1 — Models. Inference endpoints and their context limits. A model is not a vibe; it is an endpoint contract with a token budget, a latency profile, and a cost.
E2 — Tools. MCP servers, APIs, callables the agent can invoke. Every tool is a typed interface with explicit side effects. Changing a tool signature without versioning the declaration is the root cause of an entire category of production incidents.
E3 — Skills. Composed capabilities and workflows — the reusable recipes loaded on demand. A skill is a named context pattern that encodes "when you see X, do Y," without burning context every session.
E4 — Agent. Persona, role, and permission grants. An agent is not a prompt; it is a declared identity with a capability list and a governance envelope around it.

The principle across all five layers is the same: declare it, version it, resolve it. The resolution step is where the environment becomes real — the system loads E0 through E4, verifies everything is present, and binds the runtime against that snapshot. If resolution fails, the stack fails fast, before any work enters the pipeline. That is the point.

The operational stack (runtime, L5–L8)

The operational stack is what happens. It runs against the resolved environment. Four layers, bottom up.

L5 — Routing. Agent discovery and logical endpoints. A task arrives; the router identifies which agent in E4 has the declared capability to handle it. Without routing, every task goes to "the AI" — a destination so generic it is not a destination at all.
L6 — Delivery. Count and ordering guarantees for messages between agents. At-least-once with acknowledgment, deduplication on the receiver, correlation IDs that thread every exchange end-to-end. Without delivery guarantees, retries are guesswork and ordering bugs hide until production.
L7 — Prompt / Encoding. Templates, schemas, and the structured payloads that cross agent boundaries. This is where unstructured chat turns into typed contract. If the payload doesn't validate, the boundary rejects it.
L8 — Workflow. DAGs, agent tasks, business logic. The graph that decomposes a real task into routable steps, fans out parallel work, reduces results, and enforces dependencies.

Between the two hierarchies sits the environment resolution boundary — the phase transition from declarative to runtime. E4 binds to L5. Tools are registered. Permissions are confirmed. Memory is initialized. If any of that fails, the admission gate rejects the task before L5–L8 ever run.

Three components almost every team is missing

Sit with those two hierarchies for a minute and you start to see what is usually absent. Three components in particular show up everywhere and get built almost nowhere.

1. Delivery guarantees between agents. Teams invest heavily in the prompts and the models and leave L6 as "a message bus, I guess." Without at-least-once delivery with acknowledgment and receiver-side deduplication, you cannot distinguish "the agent is slow" from "the message was lost." Retries become guesswork. A task that was processed once gets processed twice because the acknowledgment never came back. Ordering bugs lie dormant until the day they don't.

2. A session memory tier. Memory has three tiers: M1 window context (the ephemeral content inside a single prompt), M2 session state (the conversation and accumulated results across a multi-step task), and M3 durable memory (project knowledge, CLAUDE.md, persistent conventions). Most teams build M1 and M3 and skip M2 entirely. The result is agents that either forget everything between steps or drown in long-term retrieval that is stale by seconds. M2 is where decisions made three steps ago stay fresh and queryable for the agent making step seven — and without it, the agent operates on outdated or missing context and you wonder why it "feels confused."

3. An explicit admission gate before runtime. This is the checkpoint between environment resolution and L5. Tools are bound. Permissions are confirmed. Memory is initialized. The model endpoint is reachable. The agent's capability declaration matches the task's requirements. Everything is verified before any task runs. Without an admission gate, a misconfigured tool or a missing permission produces a failure mid-workflow — three steps in, with an unrelated error message, at the exact moment the system is least debuggable.

Cross-cutting concerns, across both hierarchies

Four concerns cut across the whole stack and do not belong to any single layer: Ontology (the schema contracts and contract continuity that let E2 and L7 speak the same typed language), Memory (the M1/M2/M3 tiers above, connecting L7 back to E3/E4), Governance (permission enforcement, memory provenance, schema enforcement — the policy envelope the admission gate enforces), and Observability (tracing, logging, session recording, provenance — the diagnostic surface the whole stack is inspectable through).

These are not bolt-ons. They are the properties that make the stack operable. An agentic system without observability is not a system; it is a slot machine.

The diagnostic test

Here is the question that separates a real stack from a monolith with agent names.

When your next production failure happens, can you point to the layer where it occurred?

A prompt problem is L7.
A routing problem is L5.
A delivery problem is L6.
A tool-binding problem is the admission gate.
A stale-memory problem is M2 (or M3, if someone forgot to update it).
A capability-mismatch problem is E4 → L5 resolution.
A model-capacity problem is E1.

If you can't answer, the architecture is incomplete. Not your agents. Not your prompts. Not your models. The architecture around them.

What to do this quarter

Draw your stack on one page. Environment layers E0–E4. Operational layers L5–L8. Resolution boundary. Cross-cutting concerns. If you can't draw it, you can't operate it.
Pick one of the three missing components and build it first. Delivery guarantees, session memory tier, or admission gate. Whichever one your last production incident would have caught.
Make every layer independently observable. Every agent, every tool, every message, every state mutation — traceable to a correlation ID. The day you need observability is the day it is too late to install it.
Version the environment. Every E0–E4 artifact under source control with a resolved snapshot ID per deploy. If you cannot roll back the environment, you are not running infrastructure; you are running hope.

The shift, one more time

Most agentic systems aren't systems. They're monoliths with agent names.

Declare the environment. Run the operational stack against it. Resolve the boundary between them with an admission gate. Instrument the cross-cuts.

If you cannot pinpoint the layer where a failure occurs, your architecture is incomplete.

Why 'Prompt Engineering' Is Already Obsolete

Sat, 28 Mar 2026 00:00:00 GMT

There's a psychological trap in working with coding agents: we tend to anthropomorphize them.

When something goes wrong, it can feel like they weren't paying attention. Misunderstood something obvious. Got lazy. Got careless.

That trap is dangerous. No amount of venting changes how coding agents operate.

A coding agent is not a junior developer

A coding agent is not a person. It is a system functioning against the context you provided. Every "mistake" usually stems from one of three things:

Missing context
Ambiguous instructions
Conflicting signals

That is not a failure of effort. It is a failure of specification.

This is why frustration never helps. The agent is not slacking off. It is computing a probability distribution over next tokens, weighted by the attention mechanism acting on whatever reached its context window. Give it incomplete context, it generates plausible output. Give it ambiguous context, it resolves the ambiguity the way the training distribution suggested — which may not be the way you intended. Give it conflicting context, it picks one signal and ignores the other, and the tie-breaking rule is not the rule you would have chosen.

None of that is a character defect. It is a mechanism. Once you see it as a mechanism, the impulse to argue with the model disappears. You argue with the context instead — which, helpfully, is something you can actually change.

The shift from prompts to environment

OpenAI demonstrated this at scale: a team of three engineers generating roughly a million lines of production code with coding agents in five months. Their success didn't come from clever prompting. It came from the harness they built around the agent.

That is the real shift. It is not prompt engineering. It is environment design.

Prompt engineering treats the agent like a person you have to coax. Environment design treats the agent like a component in a system whose inputs you are responsible for. The first asks, "how do I phrase this better?" The second asks, "what does the agent need to see, in what order, with what constraints, to make the correct answer the path of least resistance?"

The teams that crossed from pilots to production made exactly this shift. Every team still grinding prompt variations is asking a 2024 question in 2026.

The four primitives of environment design

In practice, environment design reduces to four primitives. Each of them is something you build, not something you write once.

Memory. Persistent context the agent loads automatically at the start of every task. Project conventions. Architectural decisions. The pattern your team uses for error handling. The reason you chose Vitest over Jest. Memory is the difference between explaining yourself again every session and having the agent start already fluent in how your codebase actually works. Without memory, every session is a cold start, and cold starts are where "confidently wrong" output comes from. With memory, the agent's first token is already aimed at your team's answer, not at the training distribution's average answer.

Skills. Reusable patterns of reasoning, packaged as named capabilities. A commit skill. A code-review skill. A migration-generator skill. Each is a small Markdown file with a trigger condition and a set of instructions. The agent loads the skill on demand, executes its pattern, and moves on. Skills turn every workflow you used to re-explain into a callable primitive. The win is compounding: every skill you write is a piece of friction that never slows the team down again.

Workflows. Plan → critique → implement → test. Or discover → specify → generate → validate. Or whichever decomposition your work actually requires. A workflow is a directed graph of steps, each with inputs, outputs, and validation. It does not run in a single prompt. It runs across a sequence of agent actions, with state passed between them. Without workflows, multi-step work is a single long conversation that degrades as context fills. With workflows, each step has its own clean context and its own gate.

Agents. Roles defined by constraints, not personalities. An agent is not a "helpful assistant." It is a declared role with a capability list, a permission envelope, and a contract for the handoffs it participates in. "Code review agent" is a role with a clear input (a diff), a clear output (pass/fail with issue list), and boundaries on what it can touch. The moment you start defining roles instead of personas, the orchestrator knows where to route work and the agent knows what "correct" looks like for its scope.

These four primitives compose. Memory feeds agents. Skills are invoked inside workflows. Workflows orchestrate agents. The whole thing is engineered, not coaxed.

The diagnostic reframe

Here is the single most useful change you can make to your team's debugging culture this month.

When an agent produces wrong output, replace this question:

"Why did it do that?"

With this one:

"What in the context made that the most likely outcome?"

The first question is about the agent. It has no useful answer, because the agent is a probability distribution, not a deliberating mind. The second is about the input. It has a concrete answer every time — the output reflects what the attention mechanism weighted most heavily, and what it weighted most heavily was something in the context window. A missing requirement. An ambiguous phrase. A conflict between a stated constraint and an in-memory convention. An example that pointed the wrong direction.

Find what in the context produced the wrong weight, fix that, and regenerate. The loop collapses. The frustration evaporates. You are no longer arguing with the model; you are editing the specification the model operated on.

Reacting vs. engineering

The distinction between reacting and engineering is the entire shift this playbook is built on.

Reacting is the old loop. The output is wrong. You rephrase the prompt. The output is still wrong. You rephrase again. Eventually you get something acceptable, you ship it, and next week the exact same problem reappears because nothing about the environment changed. You did not solve the problem. You worked around it, in one conversation, at one moment, with one person who happened to find a phrasing that worked.

Engineering is the new loop. The output is wrong. You classify the failure (missing context, ambiguity, conflict). You update the environment — the memory file, the skill definition, the workflow's validation gate, the agent's scope — so that the next run, with the next person, under the next variation of the same task, produces the right output by construction. The effort goes into the environment once and pays every subsequent invocation forever.

The irony that surprises most leaders: the more you treat the agent as a system rather than a person, the more "intelligent" its output becomes. Not because the model changed. Because the environment changed, and the environment is what determined the output in the first place.

What to do this week

Ban "why did it do that?" in retros and standups. Replace it, every single time, with "what in the context made that the most likely outcome?" The conversation will change within a week. What was blame becomes diagnosis.
Turn one repeated explanation into a skill. If someone on your team has explained the commit format, the code-review protocol, or the test-naming convention more than twice this month, that is a skill waiting to be written. Markdown file. Trigger. Instructions. Commit it.
Audit your most-used agent workflow for the four primitives. Memory — does it load automatically? Skills — are the repeated patterns named and callable? Workflows — is the work decomposed into gated steps, or is it one long prompt? Agents — are the roles constrained, or is everything going to "the AI"? Each missing primitive is a concrete build task, not a vague aspiration.
Stop rewarding clever prompts. Start rewarding environment investments. What you measure is what you get. If a senior engineer saves the team a day by updating the memory file, that is the work to celebrate. Not the clever one-shot that got lucky.

The shift, one more time

A coding agent is not a person. It is a system functioning against the context you provided.

Reacting is asking the agent to change. Engineering is changing what the agent sees.

The more you treat the agent as a system rather than a person, the more intelligent it becomes.

That is not prompt engineering. That is environment design — and that is the discipline that compounds.