What Is an AI Agent? Anatomy, Types & When to Build One
An AI agent is a cognitive loop that independently pursues goals using tools — build one only when complex decisions, brittle rules, or unstructured data make simpler approaches provably inadequate.
What Is an AI Agent? Anatomy, Types & When to Build One
"Most teams don't fail because they picked the wrong model. They fail because they built an agent when they needed a script."
Most AI agent projects don't fail because the model wasn't capable enough. They fail because the team built an agent for a problem that a 50-line deterministic script would have solved more reliably.
According to Composio's 2025 AI Agent Report, fewer than 1 in 8 agent pilots reach sustained production operation. The root causes aren't model failures — they're scope creep (34%) and over-engineering that happens before anyone runs a meaningful test.
The cost of that mistake is steep. A mis-scoped agent introduces nondeterminism where you needed determinism, adds latency where you needed speed, and creates debugging complexity where you needed auditability. Worse, it becomes the reason your team writes off "AI" entirely — when the actual problem was an architecture mismatch.
This chapter gives you the vocabulary and decision framework to determine which type of AI system a problem actually requires, before a line of code is written.
Not All AI Systems Are Agents
Here is the confusion that trips up most teams: they call everything an "agent."
A RAG pipeline that retrieves documents and answers a question is not an agent. A multi-step pipeline that classifies a support ticket and routes it to the right queue is not an agent either. Both are useful. Both involve LLMs. Neither is an agent in the technically meaningful sense.
There are three distinct tiers of AI system, and confusing them is the first failure mode.

Tier 1 — The Augmented LLM
An Augmented LLM enhances a single model call with external context. The control flow is fixed. The model answers; it does not act.
The canonical form is RAG — Retrieval-Augmented Generation. A user asks a question. Code retrieves relevant documents. The model synthesizes an answer from those documents. One call. One response. Done.
Who controls the flow: Code. Always code.
This is the right architecture for Q&A over documents, summarization, classification, and structured extraction — any task where the answer is the output, not a step toward a further action.
It's the wrong architecture for anything requiring multi-step execution, tool use, or adaptive judgment.
Tier 2 — The Deterministic Workflow
A workflow is a system where LLMs are components inside a predefined, hard-coded execution path. The model is invoked at specific points. The orchestration logic lives in code.
The canonical form: an LLM classifies a support ticket → code routes it to the right queue → a second LLM drafts a reply. Predictable. Auditable. Fast.
Who controls the flow: Code. The LLM is a function call within a larger script.
This is right for well-understood multi-step processes with known decision branches, compliance-critical pipelines, and high-volume structured tasks. It's wrong for tasks where the decision path can't be predicted in advance, or where adapting to new information mid-execution is required.
Tier 3 — The True AI Agent
An agent is a system where the LLM itself is the orchestrator. It perceives the environment, reasons about the next best action, selects and invokes tools, observes the results, and repeats this loop until a goal is reached. The control flow is dynamic and model-driven.
The canonical form: a research agent given "summarize all recent clinical trials for Drug X" — it plans a search strategy, calls a web search tool, reads results, decides to query PubMed next, synthesizes findings across sources, corrects itself when a source is paywalled, and produces a structured report. No fixed execution path exists.
Who controls the flow: The model. The code provides tools and a harness; the model determines the sequence.
This is right for open-ended goals, multi-system workflows, unstructured inputs, and tasks requiring adaptive replanning. It's the wrong choice for deterministic ETL, compliance validation with known schemas, latency-critical paths, or anything where "wrong answer" has unacceptable blast radius without a human in the loop.
The key insight that Anthropic, OpenAI, and Google all converge on: the only distinction that matters is who controls the execution path — code or model. [Anthropic, 2024; OpenAI, 2025; Google, 2024]
Think about it: Think of a process at your current company (or a project you know well) that someone has described as needing an "AI agent." Run it through the three-tier test. Which tier does it actually qualify for — Augmented LLM, Deterministic Workflow, or True Agent? What's the evidence either way?
Expert thinking
Most processes that get labeled "agent" in planning documents are actually Tier 1 or Tier 2. The fastest diagnostic is to ask one question: Does the system need to decide what to do next based on what it just observed, or does it just need to do a fixed thing well?
If the answer is "fixed thing well" — retrieving documents, classifying inputs, drafting a templated response — that's Tier 1. If the answer is "well-understood sequence of fixed steps with branching" — route → classify → draft → approve → send — that's Tier 2.
The Tier 3 signal is specific: the agent genuinely doesn't know at step 1 what step 3 will require. The path is discovered, not pre-specified. When that's true — and only when that's true — does the nondeterminism, latency, and debugging complexity of a real agent become worth it.
If you're unsure, that's actually informative: the process probably doesn't qualify. Genuine agent use cases usually feel obviously uncodeable. If you can imagine writing the workflow in Python, you probably should.
Self-assessment checklist:
- Did you identify who controls the flow at each step (code vs. model)?
- Did you check whether the execution path is discoverable at design time?
- Did you distinguish between "uses an LLM" and "the LLM orchestrates"?
- Did you notice if the process involves irreversible external actions (write tools)?
The Anatomy of an Agent
Every agent, regardless of domain or framework, is built from three components and one cognitive loop.

The Formal Structure
The agent state machine can be expressed precisely as:
Mt = µ(Mt-1, Ot, Zt-1, Et-1)
Where:
- Mt = internal state at step t
- Ot = current observation from the environment
- Zt-1 = prior reasoning trace
- Et-1 = prior execution result
This is more than academic notation. It tells you exactly what an agent is doing at any point: updating its internal model of the world based on what it observed, what it previously reasoned, and what it previously executed. Everything else in the system — retrieval, planning, tool use — serves this loop.
The cognitive cycle this produces:
Perceive → Remember → Reason/Plan → Act → Observe → (repeat until goal achieved)
Component 1: The Model (The Brain)
The LLM that powers the agent's reasoning. It interprets the goal, decides the next step, selects the tool to call, and synthesizes the results from previous steps.
One important design note: not every step requires the same model. Route simple classification tasks to a smaller, faster model. Reserve reasoning-heavy steps for more capable models. Start with the most capable model to establish a baseline. Optimize down only after the baseline is proven. [OpenAI, 2025]
Component 2: Tools (The Hands)
External functions and APIs the agent calls to interact with the world beyond its training data.
There are three categories:
| Tool Type | Examples | What it does |
|---|---|---|
| Data / Read | web_search, query_database, read_file | Retrieves information |
| Action / Write | send_email, create_ticket, execute_code | Changes external state |
| Orchestration | invoke_subagent, delegate_to_specialist | Calls another agent |
The critical constraint: every action tool has blast radius. A write operation is often irreversible. This distinction between read and write tools is not just taxonomy — it's where guardrails and human review gates must be placed.
Component 3: The Orchestration Layer (The Nervous System)
The cyclical process connecting brain to hands. This layer maintains state, manages memory, applies reasoning frameworks, handles tool call results, and decides when the goal is achieved.
The most common reasoning framework for production agents is ReAct (Reason + Act): the agent alternates between explicit reasoning verbalization and tool calls, and the observation from each tool call feeds back into the next reasoning step. It's debuggable, well-understood, and battle-tested. Start here.
Two alternatives exist for specific cases. Chain-of-Thought forces step-by-step reasoning traces — useful for debugging logical errors in a single turn. Tree-of-Thoughts explores multiple reasoning branches simultaneously — expensive, and only worth it when strategic backtracking is genuinely required.
Memory: The Fourth Dimension
Memory feeds all three components. It comes in four types:
- Working memory: The context window — what the agent can actively reason about right now
- Episodic memory: Past interactions, stored and retrievable over time
- Semantic memory: The knowledge base — facts, documents, domain knowledge
- Procedural memory: Tool schemas and skills the agent knows how to use
A production agent needs an explicit strategy for all four. The context window fills up. Episodic memory needs retrieval architecture. Semantic memory needs freshness policies. Procedural memory needs lifecycle management.
Think about it: You're building a travel booking agent. It has access to five tools:
search_flights,search_hotels,book_flight,book_hotel, andsend_confirmation_email. Classify each tool by type (Data/Read, Action/Write, or Orchestration). Which tools require explicit human approval gates before the agent can call them, and why?
Expert thinking
The read/write distinction here is stark:
search_flights→ Data/Read. No external state changes. Safe to call autonomously. The agent can call this as many times as needed.search_hotels→ Data/Read. Same reasoning.book_flight→ Action/Write. This is irreversible. A booking triggers financial commitments, sends confirmation emails to airlines, and may be non-refundable. This requires a human approval gate.book_hotel→ Action/Write. Same reasoning.send_confirmation_email→ Action/Write. Once sent, it cannot be unsent. Gate this.
The practical implication: an agent that can search freely but must pause for human confirmation before any booking is both useful and safe. The design pattern is "read freely, gate writes." This is the Principle of Least Autonomy applied to tool use.
A common mistake is treating all tools uniformly — either gating everything (approval fatigue, agent becomes useless) or gating nothing (catastrophic when the agent misbooks a $4,000 flight). The correct design surfaces the exact boundary where irreversibility begins.
For this agent: let the model search to its heart's content. Show the user a summary: "I found a Delta flight for $620 and a Marriott for $240/night. Confirm to book both?" That's the gate. The agent takes the action only after explicit confirmation.
Self-assessment checklist:
- Did you classify all five tools correctly by type?
- Did you identify that write tools require human approval gates?
- Did you distinguish between "safe to retry" (read) and "irreversible" (write)?
- Did you avoid the mistake of gating read tools (which creates unnecessary friction)?
When Should You Actually Build an Agent?
The question most teams skip. They jump from "we have an LLM" to "let's build an agent" without a structured qualification. Here is the framework.
An agent is a strong candidate when the workflow requires at least two of the following three conditions. A single condition is usually insufficient.

Condition 1: The decisions are too complex to encode in rules
The process requires context-sensitive judgment that cannot be captured in a decision tree you'd be willing to maintain.
Threshold question: Could you write decision logic that covers 95% of cases without significant ongoing maintenance? If yes — use a workflow. If no — it's an agent candidate.
Qualifies: Approving a customer refund that combines sentiment from the conversation, loyalty tier from CRM, known product defects from a database, and active promotional policies. No fixed rule handles all combinations reliably.
Doesn't qualify: Categorizing a support ticket as billing, technical, or returns. That's classification. A simple LLM prompt with three categories is sufficient.
Condition 2: The current rule-based system is brittle and expensive to maintain
The existing system relies on extensive if-then-else logic or state machines that are costly to update and fragile at the edges.
Threshold question: How many lines of conditional logic? How often do edge cases break the rules? If the maintenance burden is measured in days per quarter, it's an agent candidate.
Qualifies: A vendor security review involving a 100-point checklist with complex conditional branches depending on vendor type, data handled, and region. Rule maintenance is a full-time job.
Doesn't qualify: A discount calculation applying percentage rules by customer tier. That's arithmetic with a lookup table. An agent is overkill.
Condition 3: The input is unstructured and can't be pre-parsed
The workflow requires interpreting natural language, extracting meaning from diverse document types, or reasoning semantically mid-task.
Threshold question: Can the input be reliably parsed into a schema before reasoning begins? If yes — use RAG or structured extraction. If input variance is high and semantic understanding is required mid-task — it's an agent candidate.
Qualifies: Processing a home insurance claim — reading a customer email, extracting details from a PDF police report, cross-referencing policy documents, initiating a workflow. Requires multi-modal understanding and sequential judgment.
Doesn't qualify: Parsing a structured JSON webhook from a payment processor. That's data transformation. Zero agent needed.
The 8-Dimension Scorecard for Borderline Cases
For situations that don't clearly qualify or disqualify, score each dimension 0–2. Build an agent if the total is ≥ 10. [Tech Lead Playbook, 2026]
| Dimension | What it measures |
|---|---|
| Adaptive planning required | Can't pre-specify the execution path |
| Tool dependence | External actions required mid-task |
| Ambiguity / exception rate | >20% of inputs are edge cases |
| Unstructured data as primary input | Natural language, PDFs, diverse docs |
| Latency tolerance | >3 seconds is acceptable to the user |
| Blast radius acceptable | Failures are recoverable, not catastrophic |
| Evaluation oracle exists | Can you measure success programmatically? |
| Governance friction acceptable | Can decisions be explained to stakeholders? |
Two dimensions are veto-level. If blast radius = 0 (failures are catastrophic and unrecoverable) — do not build an agent regardless of total score. If evaluation oracle = 0 (you cannot define what success looks like) — do not build. You cannot improve what you cannot measure.
Think about it: A fintech startup wants to build an AI agent to assist with SMB loan application reviews. The current process: a loan officer manually reads the application PDF, checks financials against four spreadsheets, cross-references a credit bureau report, and applies judgment on exceptions (businesses in high-risk industries, seasonal revenue profiles, etc.). The rule-based checklist has 300 lines and breaks whenever lending regulations change. Review takes 3–5 business days. Score this use case on the 8-dimension qualification scorecard. What's your total, and what are the veto risks?
Expert thinking
Let's score it:
- Adaptive planning required: 2. No two loan reviews follow the same path — the officer decides which signals to weight based on what they discover mid-review.
- Tool dependence: 2. Needs to read PDFs, query financial databases, call credit bureau APIs. External actions are central.
- Ambiguity / exception rate: 2. The 300-line rule set breaking on regulations is a clear signal. Seasonal businesses, high-risk industries — these are >20% of the interesting cases.
- Unstructured data as primary input: 2. Business narratives, financial narratives, handwritten notes in PDFs. Definitely unstructured.
- Latency tolerance: 2. 3–5 day human review means a 30-second agent response is a massive improvement. Users are not expecting sub-second.
- Blast radius acceptable: 1 (at most). This is the dangerous dimension. A mis-approved loan is a real financial loss. A mis-rejected loan is a regulatory risk (fair lending laws). This is recoverable (the loan hasn't been disbursed yet), but the blast radius is not trivial. I'd score this 1, not 2.
- Evaluation oracle exists: 2. You can measure this: loan default rate on agent-approved loans vs. human-approved baseline. Resolution time. False positive / negative rate on approved applications that went to review.
- Governance friction acceptable: 1. Lending is heavily regulated. Any automated decision must be explainable to regulators. This is possible with proper reasoning traces, but it requires investment.
Total: 14 (with blast radius = 1)
This qualifies for an agent — but the blast radius dimension means the first version should keep humans as the decision-maker and position the agent as a recommendation tool. The agent surfaces findings, flags risks, suggests a decision. A human officer reviews and approves. Only after demonstrated quality at that tier should the agent be given any autonomous approval capability.
Self-assessment checklist:
- Did you score all 8 dimensions and reach a total?
- Did you flag the blast radius dimension as requiring careful handling even if non-zero?
- Did you note the governance/explainability requirement for regulated lending?
- Did you propose a "recommendation first, autonomous later" approach given the financial stakes?
The "Start Simple" Principle
The single most effective way to avoid wasted agent projects is to start at the lowest viable tier and earn your way up.

Always prototype as an Augmented LLM first. Can a well-engineered RAG system answer the question? If yes, ship it. The moment you add an execution loop, you add nondeterminism, debugging complexity, and latency. Every step up the ladder requires justification through evaluation — not intuition.
Prove the simpler tier is insufficient before advancing. Run the simpler system on 50–100 representative test cases. Measure task success rate. Only advance to the next tier if the simpler system fails to meet the threshold.
Treat autonomy as a budget, not a goal. More autonomy equals more risk. Each additional step an agent takes without human verification expands the blast radius of a mistake. Budget autonomy explicitly. In early deployments, every irreversible action — send email, write to database, post to external API — requires either human approval or a hard constraint check.
Harness quality matters more than prompt quality. The orchestration layer — the harness wrapping the model and tools — determines agent reliability more than the model or prompts. A brittle harness with no error handling, no retry logic, no step limits, and no state persistence will fail regardless of model capability. [Tech Lead Playbook, 2026] Spending 40 hours on prompt engineering before building any evaluation harness is the wrong order.
Define your evaluation oracle before you write code. If you cannot write a function that scores the agent's output, you cannot improve the agent. "We'll know it when we see it" is not an evaluation oracle.
Where Agents Break
Agents introduce specific failure modes that simpler architectures avoid. You need to understand them before you're in production dealing with them.
Nondeterminism at scale. The same input can produce different outputs across runs. For compliance-critical workflows or applications where users expect consistency, this is disqualifying without explicit mitigation.
Debugging complexity. A workflow failure shows a stack trace. An agent failure shows a trajectory — a sequence of decisions and observations across multiple tool calls. Most existing monitoring infrastructure is not built for this. Investing in trajectory-level observability is required before production.
Cost and latency compounding. Each ReAct step costs tokens. A 10-step agent loop at $0.01/1K tokens on a 2K-token prompt consumes 10× the token budget of a single LLM call. At 2-second inference latency per step, 10 steps = 20 seconds minimum before tool call overhead. [Google Agents Companion, 2024]
Runaway loops. Without explicit step limits and exit conditions, agents can loop indefinitely. A ReAct loop without a max-turns constraint will run until the context window fills or an API rate limit triggers. This is a production incident, not a model behavior.
Multi-agent coordination cost. Most production failures in multi-agent systems are coordination problems — not model quality, not prompt wording. Mismatched topology, poor state management, and missing guardrails account for the majority of multi-agent failures. [Technical Architecture and System Design, 2026]
The Core Insight
Before any architecture decision, before any framework selection, before any prompt engineering — ask three questions: Does this problem actually require an agent? Can I measure success? Are the failure modes recoverable?
Most problems don't require an agent. A script, a RAG pipeline, or a deterministic workflow will be more reliable, cheaper to run, and easier to debug. The job of this discipline is not to build the most impressive-looking system. It is to build the system that actually solves the problem.
Autonomy must be earned through evaluation — not assumed from a convincing demo.
What's Next: Advanced Practice
Ready to stress-test what you've learned? The advanced exercises below put you in real production scenarios — messy, ambiguous, and with no single clean answer. The first exercise: you're six months into an agent deployment. Error rates are acceptable, but your CTO is pushing to increase autonomy before the monitoring infrastructure is ready...
Advanced Applied Exercise preview: Your agent handles contract review for a legal operations team. It works well for standard NDAs (92% accuracy) but struggles with custom indemnification clauses (64% accuracy, 18% hallucination rate). The head of legal wants to add the agent to the M&A due diligence workflow. "It's the same kind of document," she says. "Why can't it just handle those too?"...
Real-World Implementation preview: In 2024, a major financial services firm rebuilt their loan underwriting assistance pipeline as an agentic system. The architecture decision that made or broke the project wasn't the model choice or the prompt design — it was how they handled the boundary between agent autonomy and compliance requirements in regulated markets...
Interview Reasoning preview: A VP of Engineering asks you, in a 10-minute conversation, to justify why the team is building an agent instead of a deterministic workflow. "Both use LLMs," she says. "What's the actual difference, and how do you know we need the more expensive, less predictable one?"...
Subscribe to unlock the full advanced practice section.
Advanced Applied Exercises
Exercise 1: When an Agent's Ceiling Looks Like a Capability Problem
Scenario: You're the ML lead at a legal-tech startup. Six months ago you deployed an AI agent for contract review. The agent reads contracts (PDFs), flags risk clauses, suggests edits, and rates overall risk. It handles standard NDAs with 92% accuracy. Custom indemnification clauses and cross-border jurisdiction provisions are harder: 64% accuracy, 18% hallucination rate on specific clause references.
Your head of legal wants to extend the agent to M&A due diligence. "It's the same kind of document. We just need it to handle more clause types." Your CTO agrees and is pushing to expand scope. The agent's performance on M&A documents in a 50-document internal test: 58% accuracy.
You have two months before the next board meeting where the head of legal plans to demo the M&A capability to an acquirer's legal team.
What do you recommend, and how do you defend it?
Expert thinking
The 58% accuracy on M&A documents and the 18% hallucination rate on custom clause references are disqualifying for this expansion — regardless of timeline pressure.
The right recommendation: Do not expand to M&A due diligence at the current quality level. Here's the defense:
Why the accuracy gap matters more than it looks: In contract review, a missed risk clause or a hallucinated clause reference isn't a recoverable error in the normal sense. An M&A deal where your agent falsely certifies a provision as standard (when it isn't) is a legal liability event, not a user experience problem. Blast radius is extremely high.
What the 58% accuracy means: If the head of legal's team reviews 50 M&A documents using this agent and trusts its outputs, roughly 21 will have meaningfully wrong assessments. In an M&A context, each of those is a potential issue that surfaces at closing.
The right path: Two options that aren't "expand the scope":
- Treat the M&A pilot as a research phase, not a capability expansion. The agent runs in parallel with human review — it surfaces candidates for human attention, not final assessments. Measure agreement rate with human reviewers on 200 documents before any demo.
- Scope the M&A demo to the specific clause types where the agent's accuracy is above 85% (likely standard boilerplate). Demonstrate quality on what it does well, explicitly exclude what it doesn't.
The board demo with 58% accuracy on M&A documents is a reputational risk to the company, not a showcase. The correct recommendation is to present option 2 to the head of legal: "Here's what the agent can reliably do in M&A today, and here's what we need to reach the accuracy threshold for full due diligence."
Self-assessment checklist:
- Did you identify blast radius as the disqualifying factor, not just the accuracy number in isolation?
- Did you propose a concrete alternative path rather than just "no"?
- Did you recognize the evaluation oracle problem (58% on a 50-document test is statistically weak)?
- Did you resist the timeline pressure with a principled argument, not just caution?
Exercise 2: Designing HITL Gates for Write Tools
Scenario: You're architecting a customer outreach agent for a B2B SaaS company. The agent has three tools:
query_crm(filters)→ returns matching customer recordssend_email(to, subject, body)→ sends email from the sales rep's addressupdate_crm_field(customer_id, field, value)→ updates a field in the CRM
The sales team's ask: the agent should identify at-risk customers (based on usage drops), draft personalized outreach emails, and mark the CRM record as "at-risk — outreach sent" after sending.
A senior sales rep objects: "If this thing fires off emails at the wrong people, I'm done. It needs to show me before it sends anything."
Your CTO wants maximum automation: "If we stop every email for approval, we get no efficiency gain."
Design the HITL architecture. What gets gated, how, and under what conditions?
Expert thinking
Both the sales rep and the CTO are right about their respective concerns — and the correct design satisfies both.
Step 1: Classify the tools by blast radius:
query_crm→ Read. No blast radius. Never gate.send_email→ Write. High blast radius. An email sent to 500 customers with an incorrect personalization, or to the wrong segment, is a reputational and relationship risk. Gate.update_crm_field→ Write. Low-to-medium blast radius. CRM fields can be corrected. The "at-risk — outreach sent" flag is informational, not customer-facing. Gate with lower friction.
Step 2: Design the gates:
For send_email: Use a batch review queue, not per-email approval. The agent runs its full identification and drafting process, then surfaces a queue of N emails with preview. The rep reviews the queue (with one-click approve/edit/reject per row) and clicks "Send approved." This preserves quality control without creating per-email friction. The CTO gets efficiency because reviews happen in batches; the rep gets control because nothing sends without review.
For update_crm_field: Use post-action notification with undo window. The agent updates the field immediately (to avoid blocking the workflow) but sends a digest to the rep: "Marked 12 customers as at-risk today. [View and undo within 24 hours]." Most updates will be correct; the 24-hour undo window handles the rare errors.
Step 3: Define the conditions for non-gated send: Over time, after the rep has reviewed 500+ emails and the override rate falls below 3%, the gate can shift to an anomaly-only model — the agent sends autonomously but flags anything outside learned patterns for review. This is earned autonomy.
The principle: Match gate friction to blast radius and reversibility. Not every write tool gets the same gate. High blast radius + irreversible = hard gate. Low blast radius + reversible = soft gate with undo.
Self-assessment checklist:
- Did you distinguish between gate types (batch review vs. per-action vs. post-action undo)?
- Did you propose a path to earned autonomy rather than permanent gating?
- Did you satisfy both the rep's control requirement and the CTO's efficiency requirement?
- Did you avoid gating the read tool (
query_crm)?
Exercise 3: Single Agent vs. Multi-Agent Under Team Constraints
Scenario: You're the sole ML engineer at a 10-person startup. You're building an AI system to automate competitive intelligence briefings: the system should search the web for news about 5 named competitors, analyze their recent product changes and pricing moves, identify potential threats to your product, and generate a weekly briefing document.
You have two architectural options:
Option A — Single ReAct agent: One agent with 4 tools (web_search, read_url, analyze_text, generate_report). The agent plans and executes the entire workflow autonomously across all competitors.
Option B — Multi-agent pipeline: An orchestrator agent that delegates to 5 parallel specialist agents (one per competitor). Each specialist searches, reads, and analyzes its assigned competitor. The orchestrator aggregates and generates the final report.
You're a team of one. You have 3 weeks to ship v1.
Which architecture do you choose? What would make you switch to the other option?
Expert thinking
Choose Option A for v1, with a clear trigger to switch to Option B.
Here's why:
Option A is the right starting point because:
- Single agent = single system to debug. When something goes wrong (wrong URL scraped, missed a product announcement, hallucinated a pricing change), you have one trajectory to inspect.
- A multi-agent pipeline introduces inter-agent communication overhead, state synchronization complexity, and failure modes that are much harder to diagnose solo. If Specialist Agent 3 returns a malformed result, how does the orchestrator handle it? What's the retry logic? Who owns the shared state?
- 3 weeks is not enough time to build, test, and debug a multi-agent system to production quality as a solo engineer. Option B would likely ship with known bugs that you don't have bandwidth to fix.
What Option B offers that Option A doesn't: Parallelism (5 competitors simultaneously instead of sequentially), which cuts wall-clock time from ~25 minutes to ~5 minutes for a full briefing run. This is a latency argument, not a quality argument.
When to switch: If and only if (a) you've shipped Option A and it works reliably for 4 weeks, AND (b) the 25-minute runtime is actually a problem for your users (it probably isn't for a weekly briefing). Use the qualification scorecard — if Option A's sequential performance is insufficient after evaluation, move to Option B. Not before.
The meta-lesson: "Start simple" isn't just about architecture tier. It applies to complexity within tier too. A simple single agent is simpler than a complex multi-agent system, even when both are "agents."
Self-assessment checklist:
- Did you factor in team size as a first-class constraint?
- Did you identify a concrete trigger condition for switching to Option B?
- Did you recognize that Option B's benefit (parallelism) is a latency gain, not a quality gain?
- Did you resist the impulse to choose the more sophisticated-sounding architecture?
Exercise 4: Diagnosing an Agent that Works in Staging but Fails in Production
Scenario: You deployed a document summarization agent 2 weeks ago. In staging (500 documents), it worked well — 91% accuracy, clean trajectories, <30 seconds per document. In production, you're seeing:
- 72% accuracy on week 1 (dropped from 91%)
- Average trajectory length: 8.4 steps (staging: 3.2 steps)
- 14% of runs exceed the 20-step max-turns limit and exit with no output
- Token costs: 4.2× higher than staging budget
The documents in production are real customer uploads. Staging used a curated test set.
What's your diagnosis? What are the top three hypotheses, and how do you test each?
Expert thinking
Hypothesis 1: Distribution shift between staging and production inputs.
The most common cause of this pattern. The curated staging set probably over-represented well-formatted, clean documents. Production uploads include scanned PDFs, poorly OCR'd documents, documents in unexpected formats, or documents with layouts the agent's PDF parser doesn't handle well.
How to test: Sample 50 failing production documents. Compare their raw parsed text to 50 passing staging documents. Look for: unusual characters, encoding issues, empty sections, multi-column layouts that parsed incorrectly, or document types the agent wasn't designed for.
Hypothesis 2: The agent is looping on unclear or corrupted inputs.
The 8.4-step average (vs. 3.2 in staging) and 14% max-turns exits strongly suggest the agent is retrying steps on inputs where it can't make progress. Common pattern: the agent reads a page, the content is garbled, it "tries again" with a slightly different tool call, fails again, loops.
How to test: Pull the full trajectories from the 14% of runs that hit max-turns. Look for repeated identical or near-identical tool calls. If you see read_page(7) called 4 times in a row, that's a loop — the agent doesn't know when to give up on a page.
How to fix: Add explicit loop detection in the harness. If the same tool is called with the same arguments twice consecutively, inject an observation: "This step has already been attempted. Try a different approach or acknowledge that the input section is unreadable."
Hypothesis 3: The staging accuracy metric was too easy.
91% in staging may have been inflated if the evaluation oracle was too forgiving — e.g., BLEU score against a reference summary, or a simple length check. In production, if you're measuring user-rated quality or task-relevant accuracy, the 72% might be the honest number, and the 91% was an artifact of a weak eval.
How to test: Re-run 50 staging documents through the production evaluation oracle. If their score also drops to ~72%, the staging eval was wrong.
Self-assessment checklist:
- Did you identify distribution shift as the primary hypothesis?
- Did you connect the trajectory length increase to looping behavior specifically?
- Did you propose a concrete test for each hypothesis (not just "investigate further")?
- Did you include the possibility that staging accuracy was measuring the wrong thing?
Real-World Implementations
Teardown 1: DraftKings — 40K QPS Sports Betting Orchestration
DraftKings processes upwards of 40,000 queries per second during peak sports events. Their agentic infrastructure routes pricing queries, risk management decisions, and promotional eligibility checks across a multi-agent system that must respond in <200ms.
The architecture decision that matters: They chose a strict read/write agent separation at the infrastructure level — not just as a design principle, but as a hard architectural boundary. Read agents (pricing, odds, eligibility lookups) run in one fleet with no write permissions to production state. Write agents (bet placement, account updates, promotional disbursement) run in a completely separate fleet with strict rate limits and audit logging.
Why this decision was right: At 40K QPS, a bug in the write agent fleet could disburse incorrect promotions to millions of users in minutes before a human can intervene. The read/write separation means that even if the write agent has a bug, it's physically isolated from the read-heavy pricing path. The blast radius of a write agent failure is bounded by the write fleet's rate limits.
Expert commentary: This is the Principle of Least Privilege applied at the infrastructure level, not just the code level. The interesting insight is that the separation was not primarily a security decision — it was a reliability decision. By isolating write operations, DraftKings reduced the probability that a read-path spike (peak game traffic) would cause resource contention with write operations, and vice versa. The security benefit was secondary.
What you'd do differently in hindsight: The team publicly noted that their initial system had too many agents trying to share state via a central Redis cluster. The coordination overhead under peak load created latency spikes. The fix: move toward agents with minimal shared state, using event-driven patterns (Kafka) for communication rather than shared mutable state. This is the consistency model lesson — agents that communicate by events rather than shared locks scale better under bursty traffic.
Teardown 2: GitHub Copilot — Tool Design for a Code Completion Agent
GitHub Copilot is one of the highest-scale agentic deployments in existence, serving tens of millions of developers. Its core tool design reveals a principled approach to agent tool taxonomy.
The architecture decision that matters: Copilot distinguishes between inline completion (single-step, no tool use — an Augmented LLM pattern) and Copilot Chat / Workspace (multi-step, tool-using agent). The team explicitly built and deployed the simpler tier first, proven it at scale, and layered agent capabilities on top.
The tool taxonomy for Copilot Workspace is strict: tools are classified as context readers (codebase search, file read, test output read) vs. code writers (file edit, file create). The agent has broad read permission, but write operations go through a diff-preview step that requires developer confirmation before any file is actually modified.
Why this decision was right: Code is a write operation with high blast radius. A wrong edit to a production codebase can break builds, introduce security vulnerabilities, or corrupt data. By making the developer the approval gate on all writes — through a diff preview that makes the change explicit before it's applied — Copilot achieves high utility without ever acting autonomously on write operations at the file system level.
Expert commentary: The diff-preview is not just a UI decision. It's an evaluation oracle. Every time a developer rejects or modifies a suggested diff, that's a labeled example of "agent output that wasn't good enough." The acceptance rate on diffs is a real-time quality metric. When acceptance rate drops for a specific code type or language, that's a signal that the model's performance on that input distribution has degraded.
The lesson for practitioners: design your HITL gate so it generates training data. The human's correction is a signal, not just a safety check.
Production Challenges
Challenge 1: The 3am Alert — An Agent Looping in Production
Context: It's 3am. You're on-call. You get paged: the customer onboarding agent is showing anomalous token spend — 40× the normal hourly rate. The agent's job is to process new customer sign-up forms, verify identity documents, and provision accounts. SLA is 5 minutes per customer; you're seeing customers stuck for 45 minutes.
What you see in the logs:
[Session abc-447] Step 1: read_document(customer_id=9921, doc_type="passport") → SUCCESS (extracted fields: name, DOB, country)
[Session abc-447] Step 2: verify_identity(name="Mehmet Öz", DOB="1988-04-15", country="TR") → ERROR: "Name encoding error: special character in name field"
[Session abc-447] Step 3: REASONING: "The identity verification failed. I should retry with a corrected name format."
[Session abc-447] Step 4: verify_identity(name="Mehmet Oz", DOB="1988-04-15", country="TR") → ERROR: "Name mismatch with document"
[Session abc-447] Step 5: REASONING: "The name mismatch suggests I extracted the name incorrectly. I should re-read the document."
[Session abc-447] Step 6: read_document(customer_id=9921, doc_type="passport") → SUCCESS (extracted fields: name, DOB, country)
[Session abc-447] Step 7: verify_identity(name="Mehmet Öz", ...) → ERROR: "Name encoding error"
[... pattern repeats for 31 more steps until max-turns reached]
Diagnose the failure. What's the root cause, what's the immediate fix, and what's the permanent architectural change?
Expert analysis
Root cause: The agent is stuck in a read → error → re-read → same error → re-read loop because:
- The
verify_identitytool returns different error messages depending on whether the name has special characters or whether it matches the document — and these two errors require different fixes. - The agent interprets "name mismatch with document" as "I read the document wrong" and re-reads it, which produces the correct special character name, which then hits the encoding error again.
- There is no harness-level loop detection. The agent can call
read_documentas many times as it wants with no consequence.
Immediate fix (now, while paged):
- Terminate all sessions stuck in this loop manually via the admin console.
- Route affected customers (anyone with a non-ASCII name field) to manual review queue as a temporary bypass.
- Deploy a hotfix: add a harness-level check that terminates the session with
NEEDS_HUMAN_REVIEWstatus if the same tool is called with identical arguments twice in the same session.
Permanent architectural change:
- Harness-level loop detection should have been in v1. Any tool called with identical arguments twice in a session should trigger an explicit harness intervention: "This step has already been attempted with this input. The agent cannot resolve this automatically. Routing to human review."
- Tool error taxonomy. The
verify_identitytool should return structured errors, not free-text strings.{"error_code": "ENCODING_ERROR", "field": "name"}vs.{"error_code": "NAME_MISMATCH", "field": "name"}are meaningfully different errors that require different agent responses. The agent is doing string matching on error messages — which is fragile. - Human-review fallback as first-class exit. Any agent session that cannot complete due to a tool error should have a clean fallback path to human review, not just a max-turns exit with no output.
Self-assessment checklist:
- Did you identify the re-read loop as the root cause (not just "the tool is broken")?
- Did you propose both an immediate hotfix and a permanent architectural change?
- Did you identify the lack of loop detection in the harness as the structural failure?
- Did you recommend structured error codes instead of free-text error messages?
Challenge 2: Token Costs 4× Over Budget After a Prompt Update
Context: Two weeks ago, a team member updated the system prompt for the research summarization agent to "improve output quality." Before the update: average 3,100 tokens per run, $0.028 per run, 4,200 runs/day → $117/day. After the update: average 12,400 tokens per run, $0.112 per run, same volume → $470/day. The team member's change: added "Think carefully about each source before summarizing" and expanded the output format instructions.
The quality improvement is real (QA team rated output quality up 12%). But the budget is $150/day.
You need to cut costs by 65% while retaining most of the quality gain. What's your approach?
Expert analysis
The 4× token increase from "Think carefully" is almost certainly because the instruction triggered verbose chain-of-thought reasoning traces in the output — the model is now explaining its reasoning before each summary section, where before it was summarizing directly.
Diagnosis first: Pull 10 sessions from before and after. Compare token usage by section: system prompt, user messages, model output. If the output tokens increased 4× but input tokens didn't change, the model is generating much longer reasoning traces.
Option 1: Reasoning traces in scratch pad, not in output (highest impact)
The most direct fix: move the "Think carefully" step to a separate reasoning step that outputs to a scratch pad (a hidden reasoning field, not the final output). The reasoning traces are generated but not included in the final billable output. Tools like Anthropic's extended thinking or a simple two-step pipeline (step 1: reason, step 2: summarize based on reasoning) can achieve this. Expected token reduction: 60–70%.
Option 2: Cascade to smaller model for reasoning, larger model for synthesis
If the 12% quality gain is real, it likely comes from better reasoning, not better summarization. Use Claude Haiku or GPT-4o Mini for the reasoning step (cheap, fast), then pass the reasoning output as context to Claude Sonnet for the final synthesis (more expensive, high quality). Expected cost reduction: 40–50% while retaining most quality gains.
Option 3: Constrain output format more tightly
Expanded output format instructions may be generating more tokens in the output. Tighten the output schema — if the output is a JSON object with defined fields and field-level length limits ("summary": string max 200 words), the model cannot exceed those bounds regardless of how much it "thinks carefully."
Recommended approach: Do Option 3 first (low risk, low effort). Then instrument with Option 1's scratch pad pattern. Measure quality and cost after each change. Don't revert the prompt — iterate toward the budget target while preserving quality.
Self-assessment checklist:
- Did you diagnose the source of the token increase (output reasoning traces, not just "the prompt is longer")?
- Did you propose a path that preserves most quality gains rather than reverting the change?
- Did you suggest cascading (smaller model for cheap tasks, larger for quality tasks)?
- Did you propose an output format constraint as the lowest-effort first step?
Interview-Style Reasoning Questions
Question 1: Explaining the Agent vs. Workflow Decision to a Skeptical Executive
Setup: You're presenting a proposal to build an AI agent for contract review. The VP of Engineering interrupts: "We already use LLMs for this — we have a pipeline that extracts key clauses and flags risks. Why do we need an 'agent'? What's actually different, and is it worth the added complexity?"
Respond to the VP in under 3 minutes. Make the distinction concrete, and acknowledge the legitimate concern about complexity.
Expert thinking
Strong answer structure:
"The distinction is about who controls what happens next. In our current pipeline, code controls every step — extract clause, classify risk, flag it. That's a workflow. It works well for contracts where the relevant clauses are well-defined and the classification is consistent.
An agent is different in one specific way: the model decides what to look at next, based on what it found. For standard NDAs, we don't need that. For M&A due diligence, we do — because a discovery in section 8 about change-of-control provisions might mean the agent needs to re-examine section 3 for related indemnification language. Our current pipeline can't do that; it reads top to bottom and exits.
You're right that agents are more complex. The reasons are real: nondeterminism, harder to debug, more expensive to run, more failure modes. So here's my answer to whether it's worth it: we should only build an agent if we can prove the current workflow fails on our target cases. What I'm proposing is a 2-week evaluation: run 100 real M&A documents through the current pipeline, measure where it fails, and only proceed with the agent architecture if the failure rate is above 20%. If the current pipeline handles 90% of M&A documents well, we shouldn't build an agent."
What makes this a strong answer:
- Defines the distinction concretely without jargon
- Acknowledges the legitimate concern (complexity is real)
- Doesn't advocate for agents unconditionally — proposes an evaluation gate
- Gives the VP a specific decision checkpoint, not a vague promise
Question 2: Designing an Autonomy Budget for a New Deployment
Setup: You're designing a customer refund processing agent for an e-commerce company. The agent will handle 5,000 refund requests per day. Average refund: $47. Some are $0.99. Some are $4,200.
Design an autonomy budget: under what conditions does the agent act autonomously vs. escalate to human review? How do you set the thresholds?
Expert thinking
Strong answer structure:
Start by identifying the blast radius tiers, not a single threshold:
Tier 1 — Fully autonomous (no human approval needed):
- Refund ≤ $X (your base threshold — start at $50, calibrate based on error rate data)
- Customer history: 0 prior disputed refunds
- Reason: standard (order not received, item damaged, wrong item sent — all validatable from order data)
- Product: in-stock, not final sale
These cases represent probably 60–70% of volume. The blast radius of a wrong auto-approval at $50 is meaningful but recoverable.
Tier 2 — Auto-approve with notification and undo window:
- Refund $50–$200
- Standard reason, but customer has 1 prior disputed refund
- The agent approves, sends a notification to the ops team, and logs the action with a 4-hour undo window.
Tier 3 — Human review queue:
- Refund > $200 (or any amount where total refunds to this customer in 30 days > $500)
- Flagged reason: fraud indicator, chargeback history, "received damaged" on perishable goods
- Multiple items in same order
Tier 4 — Immediate human escalation:
- Any refund involving a regulatory complaint, social media mention, or legal threat in the conversation
- Account suspected of abuse patterns
How to set the initial dollar threshold: Start conservative ($25–$50). After 2 weeks of data, analyze the cases that fell in Tier 1. What's the false positive rate (auto-approvals a human reviewer would have denied)? If it's below 2%, raise the threshold. Never raise the threshold without empirical evidence.
Self-assessment checklist:
- Did you design multiple autonomy tiers, not a single binary gate?
- Did you specify an undo window for medium-risk actions?
- Did you propose a process for calibrating the threshold based on data?
- Did you include customer behavior signals (not just dollar amount) as a factor?
Question 3: Building an Evaluation Oracle from First Principles
Setup: You're about to build a research summarization agent. Your manager asks: "How will we know if it's working?" You need to define an evaluation oracle — a function that scores the agent's output — before you write a single line of agent code.
Design the evaluation oracle. What does it measure, how is it measured, and what's the minimum acceptable score?
Expert thinking
The question most teams answer badly: "We'll use BLEU score / semantic similarity / LLM-as-judge." These are implementation choices. The harder question is what you're actually trying to measure.
Step 1: Define the task success criteria in plain English
For a research summarization agent, success is: "The summary accurately represents the sources, covers the key claims relevant to the research question, does not introduce claims not supported by the sources, and is usable by a domain expert to make a decision without re-reading the originals."
This gives you four measurable dimensions:
- Factual accuracy — claims in the summary that can be verified against the source documents
- Coverage — key claims in the source documents that should be in the summary but aren't
- Faithfulness — claims in the summary that contradict or are absent from the sources (hallucinations)
- Usability — domain expert rating: "Would you make a decision based on this summary alone?"
Step 2: Choose how to measure each dimension
- Factual accuracy: LLM-as-judge (Claude evaluates each claim in the summary against the source) + random human spot-check (20% of cases). Target: ≥ 90% of claims verifiable.
- Coverage: Build a reference summary for each test document, marking 5–10 "must-have" key claims. Measure what percentage of those appear in the agent's summary. Target: ≥ 80% coverage.
- Faithfulness: Dedicated hallucination detector — for each claim in the summary, retrieve the most relevant source passage and evaluate whether the claim is supported. Target: hallucination rate < 5%.
- Usability: Domain expert survey on a weekly sample (10 summaries/week). 5-point scale. Target: ≥ 4.0/5.0 average.
Step 3: Define the minimum acceptable score before shipping
Primary gate: faithfulness < 5% (non-negotiable — hallucinations undermine the entire value proposition). Secondary gate: coverage ≥ 80% (otherwise the summary is systematically incomplete). If either gate fails, do not ship.
Self-assessment checklist:
- Did you define success in plain English before defining measurement?
- Did you include a hallucination-specific metric (not just "accuracy")?
- Did you include at least one human evaluation component?
- Did you define a minimum acceptable score as a shipping gate, not just "as high as possible"?
Question 4: Handling a Stakeholder Who Wants to Skip the Maturity Ladder
Setup: You present a phased deployment plan: ship an Augmented LLM first, run evaluation, then upgrade to a workflow, then potentially an agent. The head of product says: "We've been looking at competitor demos. They're already running full agents. If we start with RAG, we'll be behind in 6 months. Can't we just start with an agent?"
How do you respond?
Expert thinking
The competitive framing is a trap. Competitors' demos do not tell you what's in their production systems, what their failure rates are, or how much engineering debt they've taken on. The graveyard of "full agent" launches that quietly regressed to simpler architectures 6 months later is large.
Strong response structure:
"I take the competitive concern seriously — we can't be 6 months behind on capabilities. Here's why I don't think starting with an agent solves it, and what I'd propose instead.
Starting with an agent that isn't ready costs us more time, not less. If we build a full agent now, ship it with 65% task success rate, and watch quality complaints accumulate for 3 months, we lose 3 months fixing what we should have proven first. That's the Klarna scenario.
The Augmented LLM phase isn't a delay — it's a qualification gate. We'll know in 3–4 weeks whether the simpler system is sufficient or whether we genuinely need agent-level capability. If we need the agent, we start the build with real evaluation data that tells us exactly what the agent needs to handle. That's faster than building blind.
What I can commit to: we'll have an evaluation result in 3 weeks that tells us definitively whether we're building a workflow, an agent, or expanding the current RAG system. The competitor who ships an agent without that evidence is the competitor we'll beat in 6 months when they're firefighting quality issues."
What makes this strong:
- Doesn't dismiss the competitive concern
- Reframes "starting simple" as speed, not caution
- Names the specific risk of skipping the ladder (Klarna pattern)
- Commits to a concrete timeline for the evaluation gate
Question 5: The "We'll Know It When We See It" Problem
Setup: You're inheriting a production agent from another team. The documentation says the agent "summarizes customer feedback and identifies action items." When you ask the previous engineer how they evaluated it, they say: "We reviewed the outputs manually each week. It was doing well." There is no automated evaluation. No test set. No metric.
The agent costs $2,400/month. You need to decide whether to keep it, replace it, or improve it. How do you build the evaluation oracle from scratch, in two weeks, with no ground truth data?
Expert thinking
"We'll know it when we see it" means the system cannot be improved, because improvement requires measurement. The first task is establishing measurement, not deciding what to do with the agent.
Week 1: Build the evaluation dataset
Step 1: Pull the last 90 days of agent inputs and outputs. You have raw data.
Step 2: Sample 100 input-output pairs stratified by document type (if there is variation) and date (to check for drift).
Step 3: Have two domain experts rate each output on three dimensions (1–5 scale):
- Completeness: Does the summary cover the important feedback themes?
- Accuracy: Does anything in the summary misrepresent the source feedback?
- Action item quality: Are the action items specific, actionable, and tied to real feedback?
Step 4: Compute inter-rater agreement (Cohen's kappa). If agreement is below 0.6, the rating criteria are ambiguous — refine the rubric before proceeding.
Week 2: Build the automated oracle
Use the human-rated 100 examples as a calibration set. Build an LLM-as-judge that scores new summaries on the same three dimensions. Validate: does the LLM-as-judge agree with your human raters at kappa ≥ 0.7? If yes, you have a scalable oracle.
The decision:
Run the automated oracle over the last 30 days of production output. If average scores are ≥ 3.5/5.0 on all three dimensions with hallucination rate < 5%, keep the agent. If scores are below that threshold on any dimension, you need to fix it or replace it. The $2,400/month is only worth it if the agent produces reliable output.
Self-assessment checklist:
- Did you propose building evaluation data from existing production logs (not starting from zero)?
- Did you include human raters as a calibration step before automating?
- Did you check inter-rater agreement (a signal that your criteria are well-defined)?
- Did you connect the oracle score back to the keep/fix/replace decision?
Unlock Premium Access to access this content.
This chapter has 4 premium workbook exercises. Unlock Premium Access to practice and compare with expert reasoning.
$49 one-time — lifetime access