
Agent Capabilities Live Inside the Backend
On the SWE-bench Live variant, leading agent systems complete only 19.25% of dynamic tasks. That number matters because it is drawn from dynamic, real-world tasks, not the curated SWE-bench Verified benchmarks most vendor demos cite. The gap between those two scores is an architecture problem, and the teams closing it share a structural pattern: they treat the agent runtime as a capability inside a traditional backend, not as the backend itself.
First, the terms, because the search results for them are a mess. An AI agent is a software system that perceives input, reasons toward a goal, and acts through tools with limited step-by-step human direction. Agentic AI is the broader category: the design approach where software operates with that kind of autonomy instead of running a fixed script. The agent is the unit; agentic AI is the property. Everything below is about how to structure the unit so the property survives contact with production.
Production AI agent architectures converge on three tiers. A traditional backend with an API gateway sits upstream, handling auth, validation, and rate limiting. The agent runtime occupies the middle tier, hosting the LLM, the agent loop, tool invocation, RAG, and conversation memory. Downstream services (DynamoDB, queues, AWS Step Functions, and other workflow engines) handle durable state and business logic. Teams that collapse these tiers into a single agentic monolith ship demos that fail in production.
AWS makes this explicit in its reference architecture: "Your agent is not your backend. It's a capability inside your backend." The API Gateway or load balancer handles everything that should never touch the model. The agent runtime handles only what requires model reasoning. That boundary is the structural decision that determines whether a system holds up under production load, and it rests on foundational ideas that predate any current framework.
We built that separation into Valkyrie, our internal AI platform. The model layer (open-source LLMs, diffusion models, and assistants) sits behind a single unified API and CLI, while orchestration, scaling across AWS, RunPod, and Hetzner, and infrastructure management run as separate concerns underneath. Developers call the capability; they never call the infrastructure. On earlier prototypes where we let that boundary blur, the same failure surfaced every time: the model layer started absorbing responsibilities that belonged in deterministic code.
Agent Runtime Separation vs. Framework Choice
The failure mode for teams that skip this separation has a name: the god agent. The god agent entangles all logic and state in one loop, which creates three compounding problems. Context bloat grows as the agent accumulates conversation history, tool outputs, and business rules in a single context window. Reasoning quality degrades as that context grows noisy. Debugging becomes opaque because there is no boundary between model behavior and system behavior. The model ends up compensating for missing structure that should live in deterministic code. This is the predictable outcome of treating the agent as the architecture rather than a component within it.
Anthropic's "Building effective AI agents" guidance points directly at this. Anthropic advises starting with one-shot tool calls and keeping deterministic control logic outside the model entirely, using Claude only to select tools at well-defined decision points. AWS operationalizes this with Step Functions as the deterministic workflow engine: the agent gets invoked at specific states in the workflow, executes its reasoning, and hands back control to the deterministic layer. The deterministic flow owns the agent, invoking it at specific states and reclaiming control after.
The consequence of getting this wrong is measurable. Token costs spike as teams stuff business logic into prompts instead of routing it through code. Engineers reimplement auth and validation logic inside the model's context window rather than enforcing it at the gateway. Latency increases because the model now processes inputs it should never see. And when something fails, the call stack gives you nothing useful.
A sophisticated reader will push back here by citing Cognition AI's argument in "Don't Build Multi-Agents." Cognition's position is that a single, context-rich agent outperforms decomposed architectures for complex tasks like software engineering, which sounds like an argument against separation of concerns. But Cognition is arguing about agent decomposition within the runtime layer: one agent versus multiple agents working in parallel. That debate is about what happens inside the middle tier. The question of whether the agent should replace the backend is separate, and Cognition does not argue for that. Even Devin, Cognition's own flagship system, runs inside a larger system with deterministic infrastructure around it. The agent-as-capability pattern is orthogonal to the single-versus-multi-agent debate.
Both questions matter. They operate at different levels of the architecture, and conflating them is how teams end up building systems that are wrong in two directions at once.
The practical implication for engineering leaders: commit to the three-tier separation before you evaluate frameworks. LangChain, LangGraph, AutoGen, CrewAI, the OpenAI Agents SDK: none of them can substitute for a clean boundary between your agent runtime and your backend infrastructure. A framework running inside a god-agent architecture will fail for the same reasons the god agent always fails, just with more configuration files involved.
The boundary protects you in both directions. Upstream, your API gateway enforces auth, rate limits, and input validation without any model involvement. Downstream, your workflow engine manages durable state, retries, and business process logic without trusting the model to remember what it decided three steps ago. The agent handles what only a model can handle: reasoning over context, selecting tools, generating structured outputs at decision points where deterministic logic runs out. That is the pattern every credible production AI agent architecture builds on. The frameworks change. The three-tier separation does not.
The Perceive-Think-Act-Learn Loop
Once you accept the agent as a capability inside the backend, the next question is what happens inside the agent runtime itself.
Every credible production agent architecture, from Russell & Norvig's classical agent taxonomy to modern LLM-powered systems, implements a structured perceive-think-act-learn loop with explicit components for perception, planning, memory, tools, and output. Skipping any of these components is the leading cause of multi-step task failure. That is what the failure data shows.
A Stanford-reported analysis of agent runs found that failures cluster in steps 6 through 15 of multi-step tasks. Planning errors were the dominant source: 78 cases in the analyzed set, followed by reflection and memory failures. The compounding pattern is consistent. A wrong memory recall at step 3 produces a misguided plan by step 10. By step 20, the agent is taking actions that look coherent in isolation but are disconnected from the original task. The errors do not surface immediately. They surface after the window for cheap correction has closed.
The loop structure exists to prevent exactly this.
The four stages are perceive, think, act, and learn. In the perceive stage, the agent ingests inputs: user queries, sensor data, API responses, prior tool outputs. In the think stage, the LLM analyzes context and plans the next action. In the act stage, the agent calls tools, queries databases, or invokes external APIs. In the learn stage, the agent updates memory and refines its plan based on what came back. The core components that support this loop are the LLM as reasoning engine, working memory for session state, long-term memory in vector databases, RAG for knowledge beyond model weights, tools for external access, and an orchestration runtime to manage the loop itself. Each component carries a specific failure mode when absent.
Russell & Norvig's classical taxonomy makes this concrete. The taxonomy classifies agents as simple reflex, model-based reflex, goal-based, utility-based, or learning agents, and it maps directly onto the decisions engineering teams make today. A customer service triage bot that routes tickets by keyword is a simple reflex agent. It perceives and acts. It does not plan or learn. That is appropriate for the task. A research agent maintaining a plan tree across dozens of tool calls is a goal-based agent. It needs all four loop stages operating correctly, with memory and planning components that can survive 20-plus steps without degradation. A trading or bidding agent that maximizes expected return is utility-based: its think stage runs an explicit objective function and ranks candidate actions against it before acting. A production agent that updates its retrieval indexes based on user feedback is a learning agent, and its learn stage is the primary mechanism of value. The same taxonomy applies whether the agent is software-only or embodied in hardware like a robot or vehicle; the loop is identical, only the sensors and actuators change.
The taxonomy forces a useful discipline: name what kind of agent you are building before you choose a framework. The loop components you need follow directly from that classification. A team building a simple reflex agent that installs full agentic orchestration with vector memory and multi-step planning is paying costs with no corresponding benefit. A team building a learning agent that skips the memory architecture because it feels complex will hit the Stanford failure pattern at scale.
The objection worth taking seriously comes from Anthropic's own guidance. Anthropic argues that many production use cases need only a single one-shot tool call, not a full agentic loop, which implies the perceive-think-act-learn structure is overkill for most real workloads. That argument is correct, and it does not contradict the loop framing.
A one-shot tool call is a degenerate case of the loop. The agent perceives the input, reasons about which tool to call, acts once, and returns. The learn stage collapses to nothing because there is no subsequent step. The architecture question is which loop stages can be collapsed safely given the task at hand, not whether to use the loop at all. When the task requires only one tool call, you collapse three stages and pay nothing for them. When the task requires 15 steps with branching and state, you need every stage designed deliberately.
The practical implication is direct: design the loop first, then decide which stages your specific workload requires. Teams that skip this step either over-engineer simple workflows or under-engineer complex ones. Both outcomes are expensive, but under-engineering complex workflows produces the compounding failure pattern the Stanford analysis describes. Over-engineering a simple workflow wastes sprint capacity. Under-engineering a complex one ships a system that fails at step 12 in a way no one can debug.
The Russell & Norvig taxonomy is useful precisely because it makes the design question explicit before the build starts. What is the agent's decision model? Does it need to maintain a world model across steps? Does it optimize toward a goal or toward a utility function? Does it learn from outcomes? The answers determine which loop components are load-bearing.
Every team that has shipped a production agent system with more than a handful of steps has converged on the same conclusion: the loop is the specification. Build from the loop outward, and the component choices follow. Build from the framework inward, and you discover the missing components in production, which is the most expensive place to find them. For a detailed walkthrough of how this plays out in practice, see our step-by-step agent build process.
Memory, RAG, and Tool Design
Memory and tools define what a single agent can do. Orchestration patterns define how multiple agents or agent steps coordinate. But before you reach orchestration decisions, you need to get the lower layer right, because the data shows that most agent failures originate there, not at the orchestration level.
The three decisions with the highest impact in any agent build are memory architecture, RAG and vector database design, and tool API design. These three determine your cost ceiling, your latency floor, your reliability under multi-step load, and your governance posture. They do this more than your choice of LLM or orchestration framework. A team that picks Claude over ChatGPT but ships a broken memory architecture will fail. A team that picks LangGraph over AutoGen but ships broad, stateful tools with no idempotency will fail. The model and the framework are visible choices. Memory and tool design are the invisible ones, and they are where production systems diverge from demos. Across the engineering teams we work with at Azumo, the pattern holds without exception: the visible choices get the debate, the invisible ones decide the outcome.
Tool API Design as Core Engineering Work
OpenAI's function calling documentation and Anthropic's agent design documentation converge on the same guidance: tools should be narrow and composable, not broad and stateful. The correct mental model is a Unix pipeline, not a monolithic service call. A tool called get_customer_profile that returns a structured record is composable. A tool called handle_customer_request that branches internally based on request type is not. The former is debuggable, testable, and cacheable. The latter hides branching logic inside a black box that the agent cannot inspect.
Both OpenAI and Anthropic specify strong JSON schemas for tool arguments, server-side input validation, idempotency on write operations, explicit timeouts, and retry logic with backoff. They are the baseline for a tool that can survive a multi-step agent workflow where the same tool may be called multiple times with similar inputs, where network failures are expected, and where the agent cannot distinguish a tool timeout from a tool failure without explicit error signaling.
The Model Context Protocol illustrates how far this thinking has matured. MCP emerged in 2025 as a standardized interface for connecting agents to tools and data sources, with standardized tool schemas, plug-and-play integration, and cross-model compatibility. AWS reference architectures now show the agent runtime connecting to downstream services through an MCP client. The architectural implication is significant: tool design shifts from per-project glue code, written once and rarely revisited, to a contract-first discipline where the tool catalog itself becomes an architectural asset. A well-designed MCP-compatible tool catalog is reusable across models, testable in isolation, and replaceable without rearchitecting the agent loop. That is the difference between a tool layer that scales and one that accretes technical debt with every new integration.
Treating tool design as glue code is how teams end up with tools that have side effects, inconsistent schemas, no retry logic, and no idempotency guarantees. Each of those gaps becomes a production incident waiting for the right sequence of agent decisions to trigger it.
Bad memory design has a predictable failure signature. Teams stuff conversation history into the prompt because it is the path of least resistance. Token costs spike. Critical details from early in the conversation slide out of context as the window fills. Reasoning quality drops over long workflows because the model is reasoning over a degraded representation of what happened. In multi-agent setups, weak shared state causes agents to diverge, each operating from a different partial view of the world, producing outputs that are almost correct but misleading in ways that are hard to catch before they cause downstream errors.
We learned this the expensive way on Stovell, a predictive platform for asset managers and energy firms. The agents only produced decision-grade output because we held tightly curated domain data (historical market behavior, competitor pricing signals) in vector stores and retrieved against it, rather than stuffing it into the prompt. The retrieval layer was also what let us govern which data each agent could see. Context stuffing would have given us neither the cost profile nor the access control.
Anthropic's agent design documentation and OpenAI's memory guidance both prescribe the same structural answer: separate ephemeral conversation state from long-term knowledge. Session state, the in-progress context of the current interaction, lives in working memory. Long-term knowledge, the facts, documents, and prior outputs the agent needs across sessions, lives in a vector store: Pinecone, Weaviate, Qdrant, Milvus, or pgvector depending on your infrastructure constraints. Memory must be modular, auditable, and governed with PII handling at the retrieval layer, not just at ingestion. This separation matters for cost and for governance. Retrieving a relevant document from a vector store at query time costs a fraction of what it costs to keep that document in the context window for every turn of the conversation. At scale, that difference compounds. The discipline here overlaps closely with enterprise RAG system patterns, because the retrieval layer is doing the same job in both cases.
A sophisticated skeptic will push back here. The objection is that long-context models, with context windows now exceeding 100,000 tokens for Claude and Gemini, eliminate the need for RAG entirely. If you can fit everything in context, why manage a retrieval index?
The objection holds force at small scale and fails at production scale for three reasons. First, cost: Princeton's "AI Agents that Matter" (2024) found that when agent evaluations are controlled for dollar cost, complex architectures that rely on large context stuffing frequently lose to simpler retrieval-based baselines. The cost curve alone justifies retrieval. Second, latency: a 100,000-token context window takes measurably longer to process than a 2,000-token prompt with three retrieved chunks. For synchronous user-facing workflows, that latency is a product problem. Third, auditability and access control: a retrieval layer lets you govern which documents an agent can access at query time, enforce row-level access controls, and audit what was retrieved for any given decision. Context stuffing gives you none of that. You cannot enforce access control on content that is already in the prompt.
RAG adds operational complexity. Index management, chunking strategy, and embedding model versioning are real costs that teams consistently underestimate. But those costs are bounded and manageable. The costs of context stuffing at scale are unbounded and compound with every additional user, session, and workflow step.
Memory and tool design are the load-bearing structure of an agent architecture. Get them right, and the framework choice above them becomes a secondary decision. Get them wrong, and no framework will save the system from itself.
Orchestration Patterns and Multi-Agent Systems
Memory and tools define what a single agent can do. Orchestration patterns define how multiple agents or agent steps coordinate, and most teams choose those patterns by accident rather than by design.
Engineering leaders should choose orchestration patterns (sequential, concurrent, group chat, or handoff) based on explicit tradeoffs in latency, cost, reliability, and governance. The default assumption that multi-agent systems are universally better than single-agent systems with strong tool design is not supported by production data.
The clearest illustration comes from comparing LangGraph and AWS Step Functions. LangGraph implements a graph-based state machine where each node can be an LLM call, a tool, or a decision function, with checkpoints and stateful flows. That flexibility makes it well-suited to multi-agent workflows where branching depends on model output. AWS Step Functions implements deterministic state machines where the agent gets invoked at specific states in the workflow. Step Functions logs every state transition, exposes every failure to inspection, and produces workflow behavior that is predictable independent of what the model does inside any single state.
The choice is a tradeoff: graph-based orchestration trades determinism for flexibility. You can express complex conditional logic in terms of model outputs, handle arbitrary branching, and build workflows that adapt at runtime. The cost is that debugging requires understanding both the graph structure and the model's reasoning at each node: two failure surfaces instead of one. Step Functions trades flexibility for auditability and operational tooling. The workflow is inspectable, the failure modes are bounded, and the agent's role is constrained to well-defined decision points. The cost is that you must anticipate the branching logic in advance, which is not always possible for genuinely open-ended tasks. Engineering leaders should pick based on which tradeoff the workload demands, not based on which framework their team has already used.
The single-versus-multi-agent debate compounds this decision. Anthropic's multi-agent research system uses specialized agents (a task decomposer, researchers, and a writer) and claims reliability gains for complex research workflows. Cognition AI's published argument, "Don't Build Multi-Agents," reaches the opposite conclusion: a single agent with coherent context is more reliable for production software engineering because multi-agent systems are hard to debug, prone to miscommunication when context fragments across agents, and miss nuance when tasks are subtly decomposed. Both companies are credible. The debate is unresolved. Any leader who tells you the answer is obvious is not paying attention to the evidence.
Princeton's "AI Agents that Matter" adds pressure to the multi-agent case from a different direction. On HumanEval, repeatedly sampling the base language model can outperform complex agent architectures while costing less. The authors found pervasive reproducibility shortcomings in WebArena and HumanEval evaluations, leading to inflated accuracy estimates for complex architectures. Many reported agent gains disappear once dollar cost is normalized. The practical implication is direct: complexity carries a cost, and that cost frequently exceeds the benefit when the task does not genuinely require it.
A counterargument holds partial force here. The proliferation of orchestration frameworks (LangGraph, AutoGen, CrewAI, the OpenAI Agents SDK, AWS Strands) suggests the field is too unstable to commit to a pattern. The reasonable response is to wait for consolidation before making architectural bets. The field has churned through a new framework generation roughly every year for the past three years, which makes the impulse understandable.
The problem with waiting is that the frameworks are unstable but the patterns beneath them are not. Sequential orchestration, where one agent step completes before the next begins, is stable. Concurrent orchestration, where multiple agents work in parallel on independent subtasks, is stable. Group chat patterns, where agents communicate through a shared message stream with a controller managing turns, are stable. Handoff patterns, where one agent transfers context and control to another at a defined boundary, are stable. These patterns exist in LangGraph, AutoGen, and CrewAI, expressed differently but structurally identical. A team that commits to a sequential pattern with clear handoff boundaries can switch frameworks without redesigning the architecture. A team that has coupled its logic to framework-specific abstractions faces a rewrite.
Firecrawl has observed that MCP's resurgence is driving cross-framework interoperability, further reducing the cost of switching frameworks. When your tool catalog is MCP-compatible and your orchestration pattern is expressed at the architectural level, the framework underneath becomes a replaceable implementation detail rather than a structural commitment.
The practical sequence for engineering leaders: pick the orchestration pattern first, based on your task's latency requirements, your debugging tolerance, and your governance constraints. Evaluate frameworks against the pattern you have chosen. Treat the framework as replaceable from day one, because the history of this space suggests it will be. The architectural patterns underneath have outlasted every framework generation so far, and they will outlast the next one too. Commit to the pattern. Keep the framework on a short leash.
Where to Draw the Line on Autonomy
The single architectural decision that separates production agents from demos is graduated autonomy: which tool calls require deterministic gating, which require human approval, and which the agent can execute freely.
Most teams treat this as a policy question, something for a product manager to document in a requirements brief. It belongs in the AI agent architecture itself. A policy that lives in a document has no runtime enforcement. A policy encoded in the authorization layer of your tool calls is enforced on every execution, including the edge cases no one anticipated during design.
The three-tier model maps cleanly onto this. Deterministic gating belongs at the API gateway layer, before the agent runtime ever sees the request. Actions with irreversible downstream effects (writing to production databases, sending external communications, modifying financial records) require human approval routed through an explicit checkpoint in the workflow, not a prompt instruction asking the model to be careful. Actions with bounded, reversible scope can be delegated to the agent to execute freely, because the cost of a mistake is recoverable.
We encoded exactly this gradient on a midstream oil and gas alarms platform. The agent filtered false-positive pipeline alarms on its own, a bounded and reversible call, using anomaly-detection models like Isolation Forest and autoencoders. It sent SMS alerts directly. But a signal that looked like a real leak routed to a human before any crew was dispatched into the field, because that action was expensive and slow to reverse. The gradient lived in the workflow's authorization layer, not in a prompt asking the model to be careful. The agent also retrained on operator feedback, which is the learn stage from earlier doing real work.
Teams that skip this design end up in one of two failure modes. The first is a paralyzed agent that escalates everything to human review because no one drew the boundaries, and the model defaults to caution. The second is an unsafe agent that executes freely across all tool calls because the team assumed the model's judgment was sufficient and did not build gates. Both failures are architectural. Neither is fixable with a better prompt.
Pick the autonomy gradient before you pick the framework. The framework cannot make this decision for you, and the model definitely should not.
If you are designing an agent architecture and want engineers who have shipped these patterns in production, that is the work our AI agent development team does. The examples here are real engineering implementations, adapted for clarity and confidentiality.

.avif)

.avif)

