Most "agentic systems" aren't truly systems. They're monoliths with agent names.
If your team struggles to identify whether a failure stems from a prompt issue, a memory problem, a routing error, or a model capacity limitation, you don't have a stack. You have chaos disguised as an organizational chart.
After building an agentic system from scratch, I've concluded you need two hierarchies, not one.
Two hierarchies, not one
Your environment is declarative. It encompasses compute, models, tools, skills, and agents. You must declare it, version it, and resolve it before any operations commence. Think of it as infrastructure-as-code for your entire capability surface.
Your operational stack is runtime. It involves routing, delivery, encoding, and workflow — executing against the resolved environment.
Many teams conflate the two. The declarative surface and the runtime surface live in the same codebase, the same deploy, the same mental model. A prompt template gets patched. A tool registration gets hand-edited. A model gets swapped. Nobody can say what version of anything produced yesterday's output. That is how debt accumulates — silently, during the weeks when everything seems to be working.
The environment layer (declarative, E0–E4)
The environment is what exists. It is a versioned, inspectable surface you can point at and say: this is the capability set the system can draw from. Five layers, bottom up.
- E0 — Compute. Hardware, containers, GPUs. The substrate every agent ultimately runs on.
- E1 — Models. Inference endpoints and their context limits. A model is not a vibe; it is an endpoint contract with a token budget, a latency profile, and a cost.
- E2 — Tools. MCP servers, APIs, callables the agent can invoke. Every tool is a typed interface with explicit side effects. Changing a tool signature without versioning the declaration is the root cause of an entire category of production incidents.
- E3 — Skills. Composed capabilities and workflows — the reusable recipes loaded on demand. A skill is a named context pattern that encodes "when you see X, do Y," without burning context every session.
- E4 — Agent. Persona, role, and permission grants. An agent is not a prompt; it is a declared identity with a capability list and a governance envelope around it.
The principle across all five layers is the same: declare it, version it, resolve it. The resolution step is where the environment becomes real — the system loads E0 through E4, verifies everything is present, and binds the runtime against that snapshot. If resolution fails, the stack fails fast, before any work enters the pipeline. That is the point.
The operational stack (runtime, L5–L8)
The operational stack is what happens. It runs against the resolved environment. Four layers, bottom up.
- L5 — Routing. Agent discovery and logical endpoints. A task arrives; the router identifies which agent in E4 has the declared capability to handle it. Without routing, every task goes to "the AI" — a destination so generic it is not a destination at all.
- L6 — Delivery. Count and ordering guarantees for messages between agents. At-least-once with acknowledgment, deduplication on the receiver, correlation IDs that thread every exchange end-to-end. Without delivery guarantees, retries are guesswork and ordering bugs hide until production.
- L7 — Prompt / Encoding. Templates, schemas, and the structured payloads that cross agent boundaries. This is where unstructured chat turns into typed contract. If the payload doesn't validate, the boundary rejects it.
- L8 — Workflow. DAGs, agent tasks, business logic. The graph that decomposes a real task into routable steps, fans out parallel work, reduces results, and enforces dependencies.
Between the two hierarchies sits the environment resolution boundary — the phase transition from declarative to runtime. E4 binds to L5. Tools are registered. Permissions are confirmed. Memory is initialized. If any of that fails, the admission gate rejects the task before L5–L8 ever run.
Three components almost every team is missing
Sit with those two hierarchies for a minute and you start to see what is usually absent. Three components in particular show up everywhere and get built almost nowhere.
1. Delivery guarantees between agents. Teams invest heavily in the prompts and the models and leave L6 as "a message bus, I guess." Without at-least-once delivery with acknowledgment and receiver-side deduplication, you cannot distinguish "the agent is slow" from "the message was lost." Retries become guesswork. A task that was processed once gets processed twice because the acknowledgment never came back. Ordering bugs lie dormant until the day they don't.
2. A session memory tier. Memory has three tiers: M1 window context (the ephemeral content inside a single prompt), M2 session state (the conversation and accumulated results across a multi-step task), and M3 durable memory (project knowledge, CLAUDE.md, persistent conventions). Most teams build M1 and M3 and skip M2 entirely. The result is agents that either forget everything between steps or drown in long-term retrieval that is stale by seconds. M2 is where decisions made three steps ago stay fresh and queryable for the agent making step seven — and without it, the agent operates on outdated or missing context and you wonder why it "feels confused."
3. An explicit admission gate before runtime. This is the checkpoint between environment resolution and L5. Tools are bound. Permissions are confirmed. Memory is initialized. The model endpoint is reachable. The agent's capability declaration matches the task's requirements. Everything is verified before any task runs. Without an admission gate, a misconfigured tool or a missing permission produces a failure mid-workflow — three steps in, with an unrelated error message, at the exact moment the system is least debuggable.
Cross-cutting concerns, across both hierarchies
Four concerns cut across the whole stack and do not belong to any single layer: Ontology (the schema contracts and contract continuity that let E2 and L7 speak the same typed language), Memory (the M1/M2/M3 tiers above, connecting L7 back to E3/E4), Governance (permission enforcement, memory provenance, schema enforcement — the policy envelope the admission gate enforces), and Observability (tracing, logging, session recording, provenance — the diagnostic surface the whole stack is inspectable through).
These are not bolt-ons. They are the properties that make the stack operable. An agentic system without observability is not a system; it is a slot machine.
The diagnostic test
Here is the question that separates a real stack from a monolith with agent names.
When your next production failure happens, can you point to the layer where it occurred?
- A prompt problem is L7.
- A routing problem is L5.
- A delivery problem is L6.
- A tool-binding problem is the admission gate.
- A stale-memory problem is M2 (or M3, if someone forgot to update it).
- A capability-mismatch problem is E4 → L5 resolution.
- A model-capacity problem is E1.
If you can't answer, the architecture is incomplete. Not your agents. Not your prompts. Not your models. The architecture around them.
What to do this quarter
- Draw your stack on one page. Environment layers E0–E4. Operational layers L5–L8. Resolution boundary. Cross-cutting concerns. If you can't draw it, you can't operate it.
- Pick one of the three missing components and build it first. Delivery guarantees, session memory tier, or admission gate. Whichever one your last production incident would have caught.
- Make every layer independently observable. Every agent, every tool, every message, every state mutation — traceable to a correlation ID. The day you need observability is the day it is too late to install it.
- Version the environment. Every E0–E4 artifact under source control with a resolved snapshot ID per deploy. If you cannot roll back the environment, you are not running infrastructure; you are running hope.
The shift, one more time
Most agentic systems aren't systems. They're monoliths with agent names.
Declare the environment. Run the operational stack against it. Resolve the boundary between them with an admission gate. Instrument the cross-cuts.
If you cannot pinpoint the layer where a failure occurs, your architecture is incomplete.