Skip to content
· 18 min read

Agent Architecture Patterns

Common patterns for building AI agents that work in production.

I’ve built four agent systems in the past year. The first one was a disaster. The second was mediocre. The third worked but was fragile. The fourth ran autonomously for three hours and produced the most thorough output I’ve gotten from any AI system.

The difference between these four wasn’t the model, the framework, or the budget. It was the architecture. The patterns I used to structure how agents think, act, coordinate, and recover from failure.

Agent architectures in early 2026 are still young. The industry is collectively figuring out which patterns actually work in production versus which ones look good in blog posts and conference talks. Having built systems on both sides of that line, I have opinions about where the real value is and where the hype exceeds the substance.

The landscape, honestly

There are now over a dozen agent frameworks competing for mindshare: LangGraph, CrewAI, OpenAI Agents SDK, Claude Agent SDK, Google ADK, AutoGen, Semantic Kernel, Camel, and more. Each has its own abstractions, its own philosophy, and its own idea of what an “agent” is.

This proliferation has created confusion. Teams spend weeks evaluating frameworks before writing any agent logic. They pick a framework, get locked into its abstractions, and then discover that the framework’s opinions about how agents should work don’t match how their specific problem requires agents to work.

Here’s the current state of the major frameworks:

Framework Philosophy Strengths Weaknesses
LangGraph Graphs and state machines Fine-grained control, lowest latency, huge ecosystem (47M+ PyPI downloads) Steep learning curve, verbose for simple cases
CrewAI Teams of specialized agents Intuitive “crew” metaphor, fastest-growing for multi-agent, beginner-friendly Higher token usage, less control over execution flow
OpenAI Agents SDK Minimal abstractions Easiest to start, clean API, good docs Tied to OpenAI ecosystem, limited multi-agent support
Claude Agent SDK Extended thinking + tool use Strong reasoning, native computer use, good for complex workflows Newer ecosystem, less community tooling
AutoGen Conversational agents Strong research backing (Microsoft), good for debate-style patterns Complex setup, heavyweight for production

My take: the framework matters less than the patterns you implement within it. I’ve seen great agents built on LangGraph and terrible agents built on LangGraph. The framework provides scaffolding. The architecture provides structure.

If you’re just starting, OpenAI Agents SDK or CrewAI will get you to a working prototype fastest. If you need fine-grained control over execution flow, LangGraph is the right choice. If your agents need deep reasoning, Claude Agent SDK’s extended thinking is currently unmatched for that specific capability.

But this post isn’t about frameworks. It’s about the patterns that make agents work regardless of framework.

Pattern 1: ReAct (Reasoning + Acting)

ReAct is the foundational pattern for modern agents. The agent alternates between thinking (reasoning about what to do) and acting (executing a tool call or producing output). After each action, the agent observes the result and reasons about what to do next.

Thought: I need to find the user's recent orders to answer this question.
Action: query_orders(user_id="u_123", limit=5)
Observation: [Order #4521 - Jan 15, Order #4498 - Jan 3, ...]
Thought: The user has 5 recent orders. They asked about returns.
         Let me check the return policy for the most recent order.
Action: get_return_policy(order_id="4521")
Observation: { eligible: true, deadline: "Feb 15", method: "mail" }
Thought: I have all the information needed to answer.
Action: respond("Your most recent order #4521 is eligible for return
        until Feb 15. You can return it by mail...")

This pattern works because it separates reasoning from execution. The model thinks before it acts. This reduces the chance of premature or incorrect tool calls. It also makes the agent’s behavior interpretable: you can read the thought chain and understand why it did what it did.

When to use ReAct: Most single-agent tasks. It’s the default pattern for a reason.

When it breaks down: Long, multi-step tasks where the reasoning chain gets so long that the model loses track of earlier steps. Tasks requiring parallel execution (ReAct is inherently sequential). Tasks where the action space is very large and the model struggles to select the right tool.

Production considerations:

The thought tokens are expensive. In a production agent, you often want to limit thought length or use a cheaper model for the reasoning step and a more capable model for the final response. Some implementations separate the “planning” model from the “execution” model for exactly this reason.

Also: log the thoughts. They’re invaluable for debugging. When an agent does something unexpected in production, the thought chain tells you exactly where the reasoning went wrong. Without it, you’re guessing.

Pattern 2: Tool Use

Tool use is not a pattern in itself. It’s a capability that enables other patterns. But how you design the tool interface matters enormously for agent reliability.

The current state of tool integration is converging on a standard: Model Context Protocol (MCP), originally released by Anthropic in November 2024 and donated to the Linux Foundation in December 2025. MCP provides a vendor-neutral standard for agent-tool integration. OpenAI, Google, and Microsoft have all adopted it. It’s becoming the USB-C of agent tools: a single interface that works across frameworks and models.

What MCP gets right is the separation between the tool’s interface (schema, description, parameters) and the tool’s implementation. The agent sees a standardized description of what the tool does. The implementation can be anything: an API call, a database query, a local function, a call to another AI model.

Here’s what I’ve learned about tool design that affects agent reliability:

Tool descriptions matter more than tool implementations. The model decides which tool to use based on the description. A vague description (“processes data”) leads to wrong tool selection. A precise description (“retrieves the 5 most recent orders for a given user ID, returns order ID, date, total, and status”) leads to correct selection.

Fewer tools is better. If your agent has access to 50 tools, it will make more selection errors than if it has access to 10. Prune the tool set to what’s actually needed for the current task. If you have domain-specific tool sets, load only the relevant set based on the user’s intent.

Tool outputs should be structured and bounded. A tool that returns a 10,000-token JSON blob wastes context and confuses the model. Return only the fields the agent needs. Paginate large results. Summarize when possible.

Validate tool inputs. The model will sometimes call tools with invalid arguments. Type checking, bounds checking, and schema validation on tool inputs prevent a huge class of errors that otherwise manifest as mysterious downstream failures.

# Bad: Agent calls tool with unvalidated input
result = search_orders(query=agent_output["query"])
# Could be: None, empty string, SQL injection, absurdly long string

# Good: Validate before execution
def search_orders(query: str) -> list[Order]:
    if not query or len(query) > 500:
        raise ToolInputError("Query must be 1-500 characters")
    query = sanitize(query)
    return db.search(query)

Pattern 3: Planning

Planning is the difference between an agent that stumbles through a task and one that executes methodically. A planning agent creates an explicit plan before taking action, then executes the plan step by step, revising when needed.

There are two major planning approaches:

Plan-then-execute

The agent creates a complete plan upfront, then executes each step sequentially. This works well when the task structure is known in advance.

Plan:
1. Retrieve user's account information
2. Check current subscription tier
3. Look up available upgrade options
4. Calculate price difference
5. Present options to user with comparison table

Executing step 1...
Executing step 2...
[continues sequentially]

Pros: Predictable execution. Easy to monitor progress. The user can see the plan before execution starts.

Cons: Brittle when early steps reveal information that invalidates later steps. The plan might be wrong, and you’ve committed to it.

Adaptive planning

The agent creates an initial plan but revises it after each step based on what it learned. This is more robust but harder to implement well.

Initial plan:
1. Check user permissions → 2. Fetch data → 3. Process → 4. Return

After step 1: User doesn't have permissions for the original approach.
Revised plan:
1. ✓ Check user permissions
2. Fetch limited dataset (adjusted for permission level)
3. Process with restricted scope
4. Return with note about limited access

Production recommendation: Use adaptive planning for anything non-trivial. The cost of revising the plan (a few extra tokens for the model to reconsider) is tiny compared to the cost of executing a wrong plan to completion.

The planning object pattern: Store the plan as a structured object that the agent can read and modify. Not as free-form text in the context window. A structured plan is easier to track, display to users, and recover from if the agent hits a context limit.

{
  "plan": [
    { "step": 1, "action": "check_permissions", "status": "completed",
      "result": "limited_access" },
    { "step": 2, "action": "fetch_data", "status": "in_progress",
      "params": { "scope": "limited" } },
    { "step": 3, "action": "process", "status": "pending" },
    { "step": 4, "action": "respond", "status": "pending" }
  ],
  "revision_count": 1,
  "revision_reason": "Permission level requires scope adjustment"
}

Pattern 4: Memory architectures

Memory is where most agent systems are weakest. Current models have context windows ranging from 128K to 2M tokens, which sounds like a lot until your agent is 40 steps into a complex task and the early context has been pushed out or compressed.

Agent memory comes in three tiers:

Tier Scope Implementation Example
Working memory Current task Context window Current conversation, tool results, plan state
Session memory Current session Structured state stored outside the context User preferences learned during this session
Persistent memory Across sessions Vector DB, relational DB, key-value store User history, past interactions, learned patterns

Working memory management

The most immediate challenge is working memory. As an agent executes steps, the context window fills up with tool results, intermediate reasoning, and conversation history. Eventually, important early context gets pushed out.

Strategies for managing this:

Summarize and compress. After each major step, summarize the key findings and drop the raw tool output. Keep the conclusions, discard the evidence.

# Instead of keeping 2000 tokens of raw API response in context:
tool_result = api.get_orders(user_id)  # Returns 2000 tokens

# Compress to essential findings:
summary = f"User has {len(orders)} orders. Most recent: #{orders[0].id} "
          f"on {orders[0].date}, total ${orders[0].total}. "
          f"Return eligible: {orders[0].return_eligible}"
# Now 50 tokens instead of 2000

Persist to disk, not to context. For long-running agents, write intermediate results to files and load them only when needed. This is the pattern I used in my multi-agent system: each agent writes to its own file, and the synthesis agent reads from all files when it’s time to compile the final output.

Sliding window with anchors. Keep the system prompt and the most recent N turns in full context. Everything else gets summarized. “Anchors” are critical pieces of information (user’s original request, key constraints, important findings) that stay in full context regardless of age.

Persistent memory

Persistent memory, information that survives across sessions, is still immature in most agent implementations. The common approach is to embed key information in a vector database and retrieve it at the start of each session, but the retrieval quality is inconsistent and the update mechanism is clunky.

What I’ve found works better for production:

Structured memory over vector memory. For information that has a clear schema (user preferences, past decisions, known constraints), use a relational database with explicit fields. Vector search is for fuzzy, unstructured recall. Don’t use vector search for information you can query directly.

Memory as a first-class tool. Give the agent tools to explicitly save and retrieve memories, rather than trying to automatically detect what’s worth remembering.

tools = [
    Tool("save_memory",
         description="Save a key fact or user preference for future sessions",
         params={"key": str, "value": str, "category": str}),
    Tool("recall_memory",
         description="Retrieve saved facts about the user or topic",
         params={"query": str, "category": str})
]

Pattern 5: Multi-agent coordination

Multi-agent systems are the most hyped and most misunderstood pattern in the current landscape. The promise is compelling: specialized agents working together should produce better results than a single generalist agent. The reality is that coordination overhead often eats the specialization benefit.

Here’s what I’ve observed: each agent adds coordination overhead. Ten agents aren’t automatically better than three. The goal is specialized efficiency, not agent count.

Hierarchical orchestration

This is the pattern that works best in production. One “lead” agent coordinates multiple “worker” agents. The lead agent owns the plan, assigns tasks, reviews outputs, and handles conflicts. Worker agents focus on specific capabilities and report back to the lead.

                    ┌──────────┐
                    │   Lead   │
                    │  Agent   │
                    └────┬─────┘
                         │
            ┌────────────┼────────────┐
            │            │            │
       ┌────▼────┐  ┌────▼────┐  ┌────▼────┐
       │Research │  │Analysis │  │ Writer  │
       │ Agent   │  │ Agent   │  │ Agent   │
       └─────────┘  └─────────┘  └─────────┘

The lead agent is the only one with the full picture. Workers have focused contexts and specific mandates. This reduces the coordination overhead because workers don’t need to communicate with each other; they communicate with the lead.

Key insight from production: Equal-status agent architectures (where all agents are peers and coordinate among themselves) consistently fail at scale. They end up in coordination loops, duplicating work, or producing contradictory outputs that nobody resolves. The hierarchical pattern, where one agent has authority and responsibility, works.

The file isolation pattern

When multiple agents work in parallel, they must not write to the same files. This sounds obvious, but it’s the most common source of multi-agent failures.

My rule: each agent writes to exactly one designated output file. The lead agent reads from all files and assembles the final output. No exceptions.

workspace/
├── lead_agent/
│   └── final_report.md        # Only lead writes here
├── research_agent/
│   └── research_notes.md      # Only research writes here
├── analysis_agent/
│   └── analysis.md            # Only analysis writes here
└── shared/
    └── evidence_matrix.md     # Append-only, structured format

The shared file (evidence_matrix) is append-only and uses a structured format that prevents conflicts. Each agent appends rows. Nobody edits existing rows.

Coordination rules that prevent chaos

From trial and error (mostly error), I’ve learned these coordination rules:

1. Single source of truth for tasks. Use a shared task list that every agent reads from. The lead agent is the only one that creates, modifies, or completes tasks. Workers read their assigned tasks and report results.

2. Structured reporting. Every agent, when finishing a task, reports exactly four things: what it did, key findings, evidence strength, and what it needs next. This structure makes the lead’s job possible. Without it, the lead has to parse unstructured prose to figure out what happened.

3. Persist continuously. Agents must write intermediate findings to disk as they work, not just at the end. If an agent hits its context limit and stops, the work survives.

4. No code without approval. If agents can write code, only designated agents should be allowed to do so, and only after explicit approval. Eager agents will rewrite your test suite and introduce bugs.

Model allocation

Not every agent needs the most expensive model. This is an engineering decision that affects cost, latency, and quality.

Agent type Recommended model tier Why
Research / retrieval Fast, cheap (Haiku-class) High volume of queries, results are factual
Analysis / reasoning Capable (Sonnet-class) Needs to synthesize and evaluate
Planning / coordination Most capable (Opus-class) Decisions affect entire workflow
Code generation Capable+ (Sonnet/Opus) Errors are expensive to fix
Summarization Fast, cheap (Haiku-class) Compression is well-handled by smaller models

Matching model capability to task complexity saves 60-80% on tokens without sacrificing quality on the tasks that require intelligence.

Pattern 6: Guardrails and safety

This pattern is underinvested across the industry. Most agent tutorials show you how to build a capable agent. Almost none show you how to build a safe one.

Input guardrails

Check the user’s input before the agent sees it. Detect prompt injection, inappropriate content, and out-of-scope requests.

async def process_request(user_input: str) -> Response:
    # Guardrail: Input validation
    injection_check = await detect_injection(user_input)
    if injection_check.flagged:
        return Response("I can't process that request.")

    scope_check = await check_scope(user_input)
    if not scope_check.in_scope:
        return Response("That's outside my area. Try X instead.")

    # Only after guardrails pass does the agent see the input
    return await agent.run(user_input)

Output guardrails

Check the agent’s output before the user sees it. This catches hallucination, policy violations, and format errors.

Execution guardrails

Limit what the agent can actually do. Maximum number of tool calls per turn. Maximum tokens per response. Maximum cost per session. Timeouts on every external call.

agent_config = {
    "max_tool_calls_per_turn": 10,
    "max_turns": 50,
    "max_cost_per_session": 5.00,  # dollars
    "tool_timeout_seconds": 30,
    "require_approval_for": ["delete_*", "send_email", "modify_billing"]
}

The require_approval_for list is critical. Some actions should always require human confirmation. The SaaStr incident (autonomous agent dropping a production database) happened because nobody restricted what the agent could do. Execution guardrails would have prevented it.

The circuit breaker

If an agent starts failing repeatedly, stop it. Don’t let it burn through tokens retrying a broken approach.

class CircuitBreaker:
    def __init__(self, failure_threshold=3, reset_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.state = "closed"  # closed = operational

    def record_failure(self):
        self.failures += 1
        if self.failures >= self.threshold:
            self.state = "open"  # open = stopped
            # Alert engineering
            notify("Agent circuit breaker tripped")

    def can_proceed(self) -> bool:
        return self.state == "closed"

Pattern 7: Human-in-the-loop

The most production-ready pattern is often the least exciting: keep a human in the loop for decisions that matter.

This doesn’t mean “the human approves every action.” It means designing specific checkpoints where the agent pauses for human review.

Checkpoint placement matters. Put checkpoints at high-impact decision points, not at every step. An agent that asks for approval 20 times per task will be abandoned. An agent that asks at the three critical junctures (before executing an irreversible action, before sending external communication, before making a decision that affects cost) will be trusted.

checkpoints = {
    "plan_approval": "Before executing, show the plan to the user",
    "external_action": "Before sending email/message/API call",
    "cost_threshold": "Before actions exceeding $10",
    "destructive_action": "Before delete/modify operations"
}

The goal is to graduate toward less oversight as the agent proves reliable. Start with human-in-the-loop for everything. Move to human-in-the-loop for high-stakes decisions only. Eventually, move to human-on-the-loop (monitoring but not approving) for well-understood workflows.

What’s overhyped

Let me be direct about what I think is overhyped in the agent space right now.

“Autonomous” agents

The marketing says “fully autonomous AI agents.” The reality is that fully autonomous agents are appropriate for narrow, well-defined, low-stakes tasks. For anything with real consequences, you want human oversight. The companies that are deploying agents successfully in production are the ones with the most guardrails and human checkpoints, not the least.

Agent count as a metric

“Our system uses 15 agents!” is not impressive. It’s a warning sign. More agents means more coordination overhead, more failure modes, and more cost. The question is: does each agent add value that couldn’t be achieved with a simpler architecture? In my experience, most tasks that teams decompose into 10+ agents would work better with 3-5 well-designed agents.

General-purpose agent frameworks

No framework will save you from thinking about your specific problem. Frameworks provide scaffolding, but the critical decisions (what tools to provide, how to structure the plan, what guardrails to implement, where to put human checkpoints) are all domain-specific. The framework can’t make these decisions for you.

“Agents replace software”

Agents don’t replace software. They extend it. An agent that accesses a database still needs the database. An agent that calls APIs still needs the APIs. An agent that generates code still needs the deployment pipeline. Agents are a new interface to existing software infrastructure, not a replacement for it.

What’s underrated

Context engineering

The most important skill in building agents isn’t prompt engineering or framework expertise. It’s context engineering: deciding what information the agent has access to at each step. What’s in the system prompt. What’s retrieved at query time. What’s in the tool descriptions. What’s persisted across turns.

An agent with perfect context and a mediocre prompt will outperform an agent with perfect prompts and bad context every time.

Observability

You can’t improve what you can’t see. Agent observability (logging every thought, action, observation, and decision with enough detail to reconstruct the full execution) is essential for debugging, improving, and trusting agents in production.

Frameworks like LangSmith and Langfuse provide tracing infrastructure for this. If you’re building production agents without observability, you’re flying blind.

Deterministic scaffolding around non-deterministic cores

The best agent systems use the AI model for the parts that require intelligence (reasoning, synthesis, generation) and traditional code for everything else (routing, validation, state management, error handling). The non-deterministic model is wrapped in deterministic code that constrains its behavior and handles its failures.

This sounds obvious, but most tutorials show the model doing everything: parsing inputs, selecting tools, validating outputs, managing state. In production, each of those should be a deterministic code layer that the model never touches.

The architecture checklist

When designing an agent system, here’s what I run through:

Decision Options Recommendation
Single vs multi-agent Single, hierarchical multi, peer multi Start single. Move to hierarchical if you need specialization. Avoid peer multi.
Planning approach No planning, plan-then-execute, adaptive Adaptive for non-trivial tasks
Memory strategy Context only, context + session, full persistent Context + session for most cases. Persistent for user-facing agents.
Execution model Sequential, parallel, mixed Mixed: parallel for independent tasks, sequential for dependent ones
Human oversight None, HITL at checkpoints, full approval Checkpoints for high-stakes. Full approval initially, graduating to less.
Tool integration Direct API, MCP, custom MCP where available, custom for internal tools
Observability None, logging, full tracing Full tracing from day one
Guardrails Input only, output only, full stack Full stack: input + output + execution

If you’re building your first agent, start with: single agent, ReAct pattern, adaptive planning, context-only memory, sequential execution, HITL at checkpoints, direct tool integration, basic logging, and full guardrails. Evolve from there based on what you actually need, not what sounds impressive.

The agents that work in production aren’t the most sophisticated. They’re the most disciplined. Good architecture makes good agents. Framework choice is secondary. Pattern choice is primary.

Continue Reading

Against Productivity Culture

Next page →