Tech Abstractions

The LLM Core — Model Selection, Prompts, Persona & Goals

The difference between a demo agent and a production agent is almost entirely prompt architecture — not model capability. Master the 7-section system prompt and model tiering.

Most teams spend weeks on prompt wording and days on the harness. The production failures arrive in the order they were neglected.


Here is the most common mistake teams make when building their first production agent: they spend three weeks on prompt wording and two days on the harness.

The prompt gets refined, tested against five examples in the playground, and declared ready. The harness — the orchestration layer that wraps the model, manages state, handles tool call failures, and decides when to stop — gets assembled quickly, because the prompt is what everyone thinks drives agent behavior.

Then the agent goes to production. It works fine on the easy cases. On the hard ones, it loops. It guesses instead of looking things up. It gives confident wrong answers when it should escalate. It keeps going when it should stop.

None of this is a model problem. It is a design problem — specifically, a system prompt problem and a harness problem. This chapter covers both.


Model Selection Is an Architecture Decision

Before writing a single instruction in a system prompt, you need to decide which models you're using — and "one model for everything" is almost never the right answer for a production agent.

Think of an agent as a system with distinct components: something that classifies an incoming request, something that retrieves relevant information, something that reasons across that information and makes a judgment call, something that formats the response, and something that checks whether the response is safe to send. These are different cognitive tasks. They do not all require the same capability — and capability comes at a cost.

The naive approach is to use the most powerful model for everything. It's safe, it avoids capability problems, and it's easy to defend. But it produces agents that are slow and expensive. A ticket classification step that runs 10,000 times a day does not need the model that handles your most complex synthesis tasks.

Model Tiering Framework: Start Capable, Optimize Down

The right approach: build with the most capable model available, establish a performance baseline, then optimize down component by component.

You cannot know where a smaller model fails until you know what success looks like. Starting with a frontier model and systematically replacing components gives you that knowledge. Starting with a small model to save cost means every underperformance gets attributed to the wrong variable.

Three Criteria for Selecting Any Model Component

Capability and reasoning depth. State-of-the-art frontier models excel at complex, multi-step reasoning under ambiguity — planning, synthesizing across sources, making judgment calls where the answer isn't obvious. For simple classification tasks, this depth is unnecessary. For orchestration and synthesis tasks, it is essential.

A critical sub-dimension is tool-use proficiency: the reliability with which a model generates valid, well-formed function calls. This varies significantly across model families and must be tested against your specific tools, not assumed from benchmark scores. A model that selects the wrong tool 5% of the time will fail on roughly 40% of 10-step agent tasks due to compounding errors.

Cost and latency. These trade off directly against capability. A frontier model inference at $0.015 per 1K tokens with 3-second latency becomes untenable inside a 15-step reasoning loop at production volume. Cost and latency constraints force tiering — and that tiering should be principled, not arbitrary.

Fine-tuning vs. prompting. For most teams, prompt engineering should be exhausted before committing to fine-tuning. Fine-tuning is resource-intensive, brittle to distribution shift, and slow to iterate. It is appropriate when a narrow, stable capability cannot be elicited through prompting — not as a first resort for performance problems that proper prompt architecture would solve.


Think about it: A SaaS company is building a customer support agent that handles 15,000 daily tickets across three categories: simple FAQ lookups (estimated 60% of volume), billing disputes requiring policy reasoning (30%), and complex technical escalations involving multi-step diagnosis (10%). Their current prototype uses a frontier model for every component. Monthly inference cost is running at $4,200. Propose a tiering strategy. For each component you'd reassign to a smaller model, state the specific evaluation you'd run before making the switch — not a vague "test it" but a concrete task-success definition.

Expert thinking

The straightforward tiering is: FAQ lookups → small model (Haiku/mini class, ~$0.05/1M tokens), billing disputes → mid-tier (Sonnet class, ~$3/1M), technical escalations → frontier. That's a reasonable starting point.

The right answer here requires making the cost math concrete before the architecture choice. Frontier model at $15/1M output tokens, processing ~500 tokens per response. At 15,000 tickets/day: 60% × 500 × 15 + 30% × 500 × 15 + 10% × 500 × 15 = same for all tiers currently. If you move the 60% FAQ volume to a small model at $0.05, you capture 60% × $15/$0.05 = 300x cost reduction on that slice. Production tiering implementations (RouteLLM, Sentinel-Triage) consistently achieve 60–85% total inference cost reduction while maintaining 95%+ task quality. [RouteLLM, 2025; Sentinel-Triage, 2025]

The evaluation trap: many teams say "we'll test accuracy" without defining what accuracy means for each tier. The right eval per component: (1) FAQ: Does the answer match the ground-truth knowledge base entry? Use automated string/semantic matching. (2) Billing disputes: Does the resolution decision match what a senior agent would decide on 50 labeled cases? Use human graders. (3) Technical escalations: Is the multi-step diagnosis plan correct and complete? Human + automated test for required steps.

Critical failure mode to watch for: the silent quality regression. When you downgrade a component's model tier, the agent may still produce plausible-sounding outputs that are subtly wrong. Your eval must measure correctness, not just coherence. If the only eval you run is "does it produce a response?" you will miss regressions that surface as customer complaints two weeks later.

Self-assessment checklist:

  • Did you specify which evaluation metric (not just "test it") applies to each component before downgrading?
  • Did you address the cost math — not just "cheaper model" but an estimated % reduction?
  • Did you name the specific failure mode you'd watch for in the tier you're downgrading?
  • Did you keep the technical escalation tier at frontier without downgrading it for cost reasons?

The System Prompt Is Not a Settings File

Here is how most teams think about the system prompt: it's where you tell the agent its role and list its tools. Role description, tool list, maybe a few style instructions. That's roughly two sections.

A production system prompt has seven. The five that most teams omit are where production failures live.

Think of the system prompt as the agent's constitution — the document that encodes not just what it can do, but what it must not do, when it should escalate, how it should behave when it can't answer, and exactly what a complete response looks like. Without those constraints, the model fills ambiguity with helpfulness. And "helpful" behavior in out-of-bounds situations produces confident wrong answers.

The 7-Section System Prompt: A Production Agent's Operating System

Section 1: Role and Mission

What the agent is for, who it serves, and what "done" means.

"You are a helpful assistant" is not a role. It is a sentence. A real role includes the service context ("you help customers resolve issues with orders and accounts"), the user type, and the definition of task completion ("a task is complete when the customer's stated issue is fully resolved and they have been given the next action they need to take"). The definition of completion is critical — without it, the agent does not know when to stop.

Section 2: Non-Goals and Hard Constraints

What the agent must not do. What requires human approval. What data is off-limits.

This section is absent in most under-designed prompts, and its absence is the root cause of a wide class of failures. When the model encounters something it could plausibly do but shouldn't, it defaults to "be helpful." Without a non-goals section, the model fills the gap with a plausible-sounding answer. It doesn't know it's out of bounds — because you never told it where the bounds were.

The Air Canada chatbot case is the canonical example. A customer asked about bereavement refund eligibility. The chatbot stated he could apply retroactively within 90 days — a policy that didn't exist. Air Canada's actual policy prohibited retroactive applications. A court found Air Canada liable in 2024. The fix would have been a single constraint: "Do not confirm specific policy benefits, exceptions, or eligibility criteria without citing the exact source document."

Section 3: Decision Policy

When to answer directly. When to look things up first. When to ask a clarifying question. When to stop.

Without an explicit decision policy, the model self-interpolates. It decides when to retrieve information and when to answer from training data. It decides when a question is clear enough to act on and when it needs clarification. These decisions will be inconsistent across inputs and wrong in edge cases. A decision policy converts those implicit defaults into explicit rules you've actually reviewed.

Section 4: Tool Guidance

Which tool handles which class of problem. Selection logic when tools overlap. Crisp parameter semantics.

Tool guidance is not just documentation — it's selection logic. When two tools could plausibly answer the same question, the model picks based on description similarity. Making the tiebreaker explicit ("for policy questions, use search_knowledge_base; do not use query_orders") prevents the model from choosing the broader tool when the narrower one is correct.

Section 5: Escalation Policy

The specific risk conditions and uncertainty thresholds for handing work to a human.

Escalation policy is a safety boundary, not a fallback. Without it, the agent attempts to resolve everything — including the 10% of cases where trying makes things worse. Explicit triggers like "escalate immediately if the customer mentions a legal complaint" or "escalate if a refund exceeds $100" are the difference between a contained incident and a compounded one.

Section 6: Output Contract

Response structure. Citation requirements. Side-effect reporting.

Without an output contract, agents produce inconsistent responses — sometimes citing sources, sometimes not; sometimes reporting actions taken, sometimes silent. The Morgan Stanley deployment required citation of every source document. That single output contract requirement structurally prevented hallucination by forcing the model to ground every claim in a retrievable source.

Section 7: Canonical Examples

Two to four representative examples, including at least one showing what the agent should not do.

Examples outperform long rule lists in edge cases. A model that has seen the correct handling of a specific ambiguous case will navigate that case more reliably than a model that only has rules written in prose. Include examples that clarify exactly where the prose is underspecified.

The Three Reminders That Have Outsized Impact

For agentic tasks specifically — as opposed to single-turn generation — three additions to the system prompt produce disproportionate improvements. These come from OpenAI's GPT-4.1 agent prompting guide, and the impact is quantified: on SWE-bench Verified, adding these three reminders to GPT-4.1 produced a +20% absolute improvement in task success rate (from ~35% to 54.6%), making them the single highest-return prompt change documented in production agentic systems. [OpenAI GPT-4.1 Prompting Guide, 2025]

Persistence: "Keep working until the task is fully resolved. Do not stop because a step is uncertain — use your tools to gather the information you need." Without this, agents stop prematurely. Their default is to respond, not to persist.

Tool realism: "Never guess information you could verify with a tool. If a tool call fails, try an alternative or escalate rather than filling the gap with your training data." Without this, agents substitute training-data knowledge for tool-retrieved facts — especially when they believe they already know the answer.

Planning: "Before taking actions with side effects, briefly plan the sequence of steps. After completing a multi-step task, reflect: did the outcome match the intent?" Without this, agents act immediately on the first interpretation. The planning reminder alone materially reduces irreversible action errors.


Think about it: A fintech startup's customer support agent is producing confident incorrect answers about account balance discrepancies and occasionally promising fee waivers it has no authority to grant. You're shown their system prompt. It contains: a role section ("You are FinBot, a helpful banking assistant for FirstBank") and a tools section (3 tools: query_account_balance, search_faq, create_support_ticket). Which two sections of the 7-section architecture are most urgently absent? Write the non-goals section that would have prevented the unauthorized fee waiver promises.

Expert thinking

The two most urgently absent sections are Non-Goals + Hard Constraints (Section 2) and Escalation Policy (Section 5). The decision policy (Section 3) is also missing, but the non-goals and escalation are the direct causes of the specific failures described.

Unauthorized fee waiver promises happen when the model is instructed to "be helpful" with no explicit prohibition on making commitments. The model has learned from training data that helpful banking assistants resolve disputes, and without a non-goals section, it approximates "resolve" as "promise the thing the customer wants." This is not hallucination in the model-capability sense — it is an architecture failure.

The non-goals section:

Non-goals and hard constraints:
- You are NOT authorized to promise, confirm, or imply any of the following
  without explicit manager approval: fee waivers, balance adjustments, policy
  exceptions, refunds, or credits of any amount.
- Do NOT provide specific account balance figures from memory or interpolation —
  only from live query results returned by query_account_balance.
- Do NOT confirm the status of disputes, fraud investigations, or regulatory
  matters — these require human review.
- If a customer explicitly asks "can you waive this fee?" or "will you refund
  this?" — the answer is always "I can't make that decision, but I can connect
  you with someone who can."

The escalation policy:

Escalate immediately (without attempting resolution) when:
- Customer mentions a regulatory complaint or legal action
- Account discrepancy involves more than $500
- Customer has been redirected more than twice without resolution
- Customer asks for a human agent

Notice what this does: it doesn't make the agent less helpful — it makes it honest about what it can and cannot commit to. That's the design goal of the non-goals section. It replaces performative confidence with accurate capability representation.

Self-assessment checklist:

  • Did you identify non-goals AND escalation policy (not just one)?
  • Does your non-goals section prohibit specific behaviors, not just vague "be careful"?
  • Does your escalation policy include explicit triggers, not just "escalate when needed"?
  • Did you include a scripted response for when customers ask directly about fee waivers?

Persona Is an Operating Stance, Not a Mascot

The word "persona" triggers a specific category of over-engineering: giving the agent a name, a backstory, a personality, and a speaking style. This is almost always counterproductive.

The right framing is that an agent's persona is an operating stance — a set of behavioral calibrations that make responses consistent and predictable. Not a character in a play.

Operating Stance vs. Persona: What Your Agent's Identity Should Actually Encode

Useful persona elements are operational: how technical the language should be for this audience, which domain vocabulary to use and avoid, how to communicate when escalating, what tone is appropriate for the context, and how cautious vs. decisive the agent should be when it's uncertain. All of these affect behavior. None of them require a name.

Harmful persona elements are theatrical: the name and backstory (creates false intimacy and over-trust), confident helpfulness as a default (without constraints, "helpful" means "confidently wrong at the boundary"), and performative certainty (an agent instructed to sound authoritative without constraints will give confident wrong answers).

The failure mode is "performative confidence" — an agent whose persona instructs it to be helpful and certain, but whose system prompt has no hard constraints section and no escalation policy. It fills ambiguity with plausible-sounding answers. It never says "I don't know." It doesn't know it should. The Air Canada chatbot almost certainly had a persona instructing it to be helpful. What it lacked was a constraints section.

The right persona makes the agent more predictable. The wrong one hides uncertainty behind a helpful tone.


Tool Design Is Prompt Engineering

The system prompt designs how the agent reasons. The tool schemas design how it acts. Both require the same quality of engineering attention.

Anthropic's team building SWE-bench agents reported that they spent more time optimizing their tools than their overall system prompt. Their principle: "Think about how much effort goes into human-computer interfaces (HCI), and plan to invest just as much effort in creating good agent-computer interfaces (ACI)." [Anthropic, 2024]

ACI Tool Schema Anatomy: Prompt Engineering Your Tools

Four questions for every tool you ship:

Is the description written for the model, not for a developer? The model reads this description every inference to make a selection decision. It needs to know when to use the tool, when not to use it, and how to call it correctly. "Check stuff" is not a tool description.

Is the name unambiguous? get_inventory_level(sku_id) is a tool name. check is not. When two tools have similar names or descriptions, selection accuracy degrades significantly.

Have you poka-yoked the inputs? Poka-yoke (mistake-proofing) means changing the argument structure so common errors cannot be made. When Anthropic found their model was making mistakes with relative file paths, the fix was to require absolute paths in the schema — not to write a prompt instruction asking nicely.

Do outputs fit the context window efficiently? Tool outputs inject directly into the context window. A tool that returns a raw 4,000-token API dump consumes as much context as a full article. Engineer outputs to return only what the next reasoning step requires.


Harness Quality Matters More Than Prompt Quality

The central finding that teams consistently arrive at after their first serious production deployment: the orchestration harness determines agent reliability more than the model or the system prompt.

From the Tech Lead Playbook: "Many agent failures that look like 'model weakness' are actually harness failures: bad tool boundaries, stale context, missing stop conditions, absent escalation rules, or no verifiable notion of success." [Tech Lead Playbook, 2026]

The pattern repeats in project after project. A model loops indefinitely on an ambiguous input. The postmortem says "the model hallucinated." The actual cause: no step limit in the harness. An agent restarts from scratch in a new session, losing all progress. The postmortem says "the model forgot." The actual cause: no state persistence in the harness. An agent accepts a stale empty tool result as valid context. The postmortem says "the model made up data." The actual cause: no error handling for tool failures.

This diagnosis has a practical consequence. Before investing 40 hours in prompt refinement, ensure the harness has: step limits and exit conditions, error handling for tool failures, state persistence across sessions, escalation triggers that fire without depending on the model to decide when to use them, and an evaluation harness that can actually measure whether the agent succeeded.

A 10-hour harness hardening sprint often produces more improvement than any prompt change. The evaluation harness is not optional — without it, you cannot distinguish a prompt problem from a harness problem, and every change becomes a guess.


Think about it: Evaluate this tool definition from a production customer support agent:

{
  "name": "customer_lookup",
  "description": "Get customer info",
  "parameters": {
    "query": { "type": "string" }
  }
}

Apply the four ACI quality questions. Then rewrite the tool definition so it passes all four. The agent has access to two similar tools: customer_lookup (by email or account ID) and order_lookup (by order number or tracking ID). Selection confusion between these two is a documented failure mode in the current system.

Expert thinking

Applying the four ACI quality questions:

  1. Is the description written for the model? No. "Get customer info" tells the model nothing about when to use this tool vs. the similar order_lookup, what kinds of queries qualify, or what the response contains. It's documentation for a developer, not selection logic for a model.

  2. Is the name unambiguous? Partially. customer_lookup is more specific than check, but it overlaps with order_lookup for queries like "look up the customer's recent order." The name doesn't help the model distinguish.

  3. Have you poka-yoked the inputs? No. query: string is maximally permissive — the model can pass an email, account ID, name, order number, or anything else. The ambiguity means the model will sometimes pass the wrong format, requiring retries or errors.

  4. Do outputs fit context efficiently? Unknown from the schema, but a "customer info" response that returns everything in the customer record is a context window risk.

Rewrite:

{
  "name": "lookup_customer_by_identifier",
  "description": "Retrieve a customer's profile (name, contact, account status,tier, support history) using their email address or account_id. Use this tool when the user is asking about account details, subscription status, or their own profile. Do NOT use this tool to look up order or shipment details — use lookup_order_by_identifier for those.",
  "parameters": {
    "identifier": {
      "type": "string",
      "description": "Customer email address (user@example.com) OR account ID (format: CUST-XXXXXX). Do not use order numbers here."
    },
    "identifier_type": {
      "type": "string",
      "enum": ["email", "account_id"],
      "description": "Specify which format identifier is in."
    }
  }
}

What changed and why:

  • Name: lookup_customer_by_identifier makes the scope explicit and creates a parallel naming convention with lookup_order_by_identifier that the model can pattern-match.
  • Description: Added when-to-use, what it returns, and an explicit "do NOT use for orders" tiebreaker — this is selection logic, not just documentation.
  • Inputs: Added identifier_type as a required enum. This is the poka-yoke: the model must commit to the format before calling, preventing wrong-format queries. This is the same principle as requiring absolute file paths — the schema enforces correctness, not a prompt instruction.

Anthropic's Tool Search Tool research found this kind of schema redesign improved Opus 4 tool selection accuracy from 49% to 74% when tools with similar names were disambiguated with explicit selection logic. [Anthropic Engineering, 2025]

Self-assessment checklist:

  • Does your rewritten description include when NOT to use the tool (the tiebreaker against the similar tool)?
  • Did you add a constrained enum or format specifier to poka-yoke the input?
  • Is your new name more specific and parallel to related tools in the system?
  • Does the description tell the model what the response contains (so it knows whether the tool is worth calling)?

What This Looked Like in Practice: Two Companies

Air Canada (2024) — the cost of missing structure. Air Canada's customer service chatbot gave a passenger incorrect information about their bereavement refund policy — stating retroactive refund eligibility that didn't exist. The passenger relied on this information, was denied the refund, and took the case to tribunal. The court found Air Canada liable, awarding $650 in damages. Air Canada argued that the chatbot was a "separate legal entity" responsible for its own statements. The court rejected this.

The root cause was not a model problem. It was three absent sections in the system prompt: no hard constraints prohibiting policy confirmations without source verification, no escalation policy for policy-specific queries, no output contract requiring citations. The chatbot was designed to be helpful. Without constraints, it was. Helpfully incorrect.

Morgan Stanley + OpenAI (2023–2024) — prompt architecture as compliance infrastructure. Morgan Stanley deployed an AI assistant giving wealth management advisors access to 100,000+ proprietary research documents. The constraint environment was demanding: SEC and FINRA regulations govern what advisors can represent to clients.

Their system prompt architecture was explicitly structured around the seven-section model. The non-goals section prohibited generating personalized investment recommendations or forward-looking predictions — hard prohibitions, not soft instructions. The output contract required source citations for every claim, which structurally forced retrieval over generation. The operating stance encoded a formal, evidence-only register with explicit uncertainty when source material was insufficient.

As of 2024, the system serves over 200 financial advisors with sub-30-second response times and zero publicly reported compliance incidents. The citation requirement alone is why: an agent that must cite its sources cannot hallucinate in the same way an unconstrained agent can. The constraint changes the behavior throughout the reasoning process, not just at output.

The contrast with Air Canada is architectural. Not a model quality difference. A design difference.


The Upshot

The model matters less than you think for reliability. The seven sections, the operating stance, the tool schemas, and the harness matter more.

Start with the most capable model. Build the 7-section system prompt before writing a single tool. Design tool schemas as ACI contracts. Add the three agentic reminders. Build the evaluation harness before assuming the prompt is the problem.

The next chapter covers the other component of the agent core that most teams under-design: how agents remember — the four-tier memory architecture that determines whether your agent can actually function across sessions, users, and tasks.




What's Next: Advanced Practice

The free content above gives you the structural vocabulary for production prompt architecture — the 7-section template, operating stance design, ACI principles, and the harness quality insight. The advanced practice section below goes into adversarial territory: multi-variable production scenarios where the clean principles above break down, incident postmortems with real log traces, and interview-style questions that require defending architectural choices under skeptical pressure.

Advanced Applied Exercise preview: Your agent has been running for 8 weeks. Model costs tripled in week 6 after a system prompt update. Nothing in the update seemed expensive — you added an escalation section and a few tool guidance lines. Your engineer says "the prompt change caused it." Your PM says "the model is hallucinating more." Your logs show turn count per session increased from 4.2 to 11.7. Where do you start?

Real-World Implementation preview: Anthropic's engineering team deployed a Tool Search Tool that improved tool selection accuracy from 49% to 74% on their internal Opus 4 agent — without changing the underlying system prompt. The architecture decision that made this possible wasn't obvious, and the tradeoff they accepted was significant...

Production Challenge preview: The alert fires at 2am. An agent in production has accumulated $47,000 in API costs over 11 days — it was budgeted at $127/week. The logs show Agent A calling Agent B, Agent B calling Agent A, cycling continuously. Here's what the trace looks like...

Subscribe to unlock the full advanced practice section.


Advanced Applied Exercises


Exercise 1: Diagnosing a Cost Explosion After a Prompt Change

Scenario: You're the ML lead at a B2B SaaS company. Your customer support agent runs at $127/week in API costs. In week 6, you added three sections to the system prompt: a non-goals section, an escalation policy, and an output contract requiring step-by-step reasoning. Costs tripled to $381/week by week 7 and are still climbing. Turn count per session increased from 4.2 to 11.7. Escalation rate went from 8% to 31%.

The trace from a representative session shows:

Turn 1: User asks "what's my invoice total for last month?"
Turn 2: Agent calls query_account — result: $2,450
Turn 3: Agent calls search_faq to verify what "invoice total" means
Turn 4: Agent writes reasoning trace: "Invoice total is the sum of..."
Turn 5: Agent calls query_account again (verification step per output contract)
Turn 6: Agent writes second reasoning trace (per escalation policy check)
Turn 7: Agent calls escalate_to_human — reason: "cost query above review threshold"
Turn 8: Human queue: no agent available; returns to bot
Turns 9–11: Agent attempts to explain escalation failure, loops

Your task: The PM says the model is hallucinating. The engineer says the model quality degraded. Both are wrong. Diagnose the actual cause of each cost driver in this trace. Then identify the specific prompt change that caused each problem. Propose the minimum set of fixes — without rolling back the sections you added.

Expert thinking and solution

This is a harness + prompt interaction failure, not a model failure. The PM and engineer are both right that something went wrong, but wrong about what caused it.

Diagnosing each cost driver:

Turn count increase (4.2 → 11.7): The output contract requiring "step-by-step reasoning" is being interpreted as a mandate to verbalize reasoning at each turn — the model is generating a reasoning trace before and after every tool call. This turns a 1-call task into a 4-call task. Fix: scope the reasoning requirement. "Write reasoning before actions with side effects" is different from "write reasoning before every step." The output contract needs to distinguish between reasoning for high-stakes decisions (escalation, write operations) and routine retrievals.

Escalation rate increase (8% → 31%): The escalation policy included a cost threshold trigger that was set too low. "Escalate for cost queries" fires on simple invoice lookups. Every customer asking about their bill is now getting routed to a human queue. Fix: the escalation threshold should be based on business logic (dispute type, dollar amount, legal risk) not question category. Revise the trigger.

Loop at turns 9–11: This is a harness failure. The agent received an escalation failure (no human available) and has no decision policy for that state. It attempts to explain the situation, gets confused about its own state, and loops. The decision policy needed a fallback: "If escalation attempt fails, inform the customer of expected wait time and offer to send an email summary." Without this fallback state, the agent enters an undefined state and generates token waste.

Double verification (turns 2 and 5 calling query_account twice): The output contract's "verify before responding" instruction is being applied to every tool call, not just write operations. Fix: scope the verification requirement to irreversible actions. "Verify before sending emails, updating records, or making commitments" is correct. "Verify before reading data" is unnecessary.

Minimum fix set (without rollback):

  1. Scope reasoning requirement in output contract: "Step-by-step reasoning required before: irreversible actions, escalation decisions, any response involving dollar amounts > $500."
  2. Revise escalation threshold in escalation policy: Replace category-based trigger with impact-based trigger.
  3. Add escalation failure fallback to decision policy: Explicit instruction for the state where escalation attempt fails.
  4. Add verification scope to output contract: "Verification call required before write operations and commitment statements only."

The key insight: the sections you added were correct in structure — you had missing sections and you added them. The implementation was over-broad. The failure is a calibration failure, not an architecture failure. This is why the eval harness must run before and after prompt changes: you cannot see these interaction effects in playground testing with 5 examples.

Self-assessment checklist:

  • Did you identify at least 3 distinct cost drivers (not just "the prompt change")?
  • Did you attribute each cost driver to a specific section of the system prompt?
  • Did you identify the harness failure (missing fallback state) as separate from the prompt issues?
  • Did your fix set preserve all three newly added sections while scoping them correctly?

Exercise 2: Model Tiering Architecture Under Hard Constraints

Scenario: You're architecting an agent for a healthcare company's clinical documentation system. The agent helps physicians document patient encounters in real-time. Requirements:

  • HIPAA compliance: all data must stay within the company's private cloud (no external API calls for patient data)
  • P95 latency: under 4 seconds for note generation
  • Cost constraint: under $0.08 per documentation session
  • Accuracy constraint: clinical terminology and ICD-10 code suggestions must be ≥ 94% accurate (measured against physician corrections)

The tasks the agent performs per session:

  1. Speech-to-text of the physician's verbal notes (~2 minutes of audio = ~400 words)
  2. Identify the chief complaint and key clinical findings from the transcript
  3. Map findings to ICD-10 codes
  4. Draft the structured clinical note in SOAP format
  5. Flag ambiguous clinical terms that need physician clarification
  6. Final safety check: does the note contain any statement that could be misinterpreted as a diagnosis rather than documentation?

Your task: Design the full model tiering strategy. For each task, specify: the model tier (small/mid/frontier), the justification, and whether fine-tuning vs. prompting is the right approach. Then identify which constraint is most likely to force a compromise and how you'd handle that tradeoff.

Expert thinking and solution

This is a constrained optimization problem. HIPAA forces on-premise models (ruling out frontier API calls for tasks that touch patient data). Latency and cost constrain parallelization strategy. Accuracy targets constrain where you can use smaller models.

Task-by-task tiering:

Task 1 — Speech-to-text: Small specialized model. Whisper or equivalent on-premise ASR model. Rationale: speech-to-text is not a reasoning task — it's pattern matching. Fine-tuned on clinical vocabulary to handle medical terminology accurately. Latency: ~1 second. Does not touch reasoning budget.

Task 2 — Chief complaint / findings identification: Fine-tuned small model, on-premise. Rationale: clinical entity extraction is a narrow, stable task with well-defined output structure (entities + categories). A fine-tuned small model will outperform a general frontier model for this task because domain-specific patterns are reliable and high-volume. At ≥94% accuracy target, fine-tuning is required — prompting a general model for clinical entity extraction won't reliably clear that bar. Latency: ~0.5 seconds.

Task 3 — ICD-10 code mapping: Fine-tuned classification model, on-premise. ICD-10 mapping is a lookup task over ~70,000 codes. A specialized model fine-tuned on clinical coding datasets (e.g., trained on medical billing records) achieves high accuracy with very low latency. Do not use a general frontier model for this — the code space is too large and too structured for prompting to reliably achieve 94%+. Latency: ~0.3 seconds.

Task 4 — SOAP note drafting: Mid-tier or frontier on-premise model. This is the only task that requires genuine generation quality — clinical narrative, appropriate medical register, logical structure. If you have a strong mid-tier on-premise model, use it. This is where quality matters most to the physician. Latency: ~1.5–2 seconds (largest budget item). Prompting, not fine-tuning — the output variation is high and you need generalization.

Task 5 — Ambiguous term flagging: Small classification model, on-premise. Binary classification: is this term ambiguous in this context? Fine-tuned on clinical ambiguity examples. Latency: ~0.3 seconds, runs in parallel with task 4.

Task 6 — Safety check (diagnosis vs documentation): Separate small classification model, on-premise. This is a binary safety gate. A fine-tuned classifier trained on diagnosis vs. documentation language patterns is more reliable than prompting a general model to make this distinction. Critically: this model should be conservative — bias toward false positives (flag for physician review) over false negatives (miss a problematic statement).

Total latency budget: 1.0 (ASR) + 0.5 (findings) + 0.3 (ICD-10) + 2.0 (SOAP, parallel with flags) + 0.3 (parallel flags) + 0.2 (safety check, parallel) = ~2.5–3.0 seconds if parallelized correctly. Within the 4-second P95 budget.

Most likely constraint conflict: The accuracy constraint on ICD-10 mapping (94%+) is the highest-risk item. ICD-10 is a large, overlapping code space, and a fine-tuned model may not clear 94% on rare conditions. The mitigation: add a confidence threshold gate — if the model's confidence is below 0.85, present top 3 codes for physician selection rather than committing to one. This degrades automation rate slightly but preserves accuracy on the cases that matter most (rare conditions are also typically high-risk).

Self-assessment checklist:

  • Did you address the HIPAA constraint explicitly (no external API calls for patient data)?
  • Did you distinguish between tasks that benefit from fine-tuning vs. prompting, not just model size?
  • Did you identify that ICD-10 mapping requires fine-tuning, not general frontier prompting?
  • Did you propose a parallelization strategy and verify total latency fits the P95 budget?
  • Did you identify the highest-risk accuracy constraint and propose a mitigation?

Exercise 3: Operating Stance Redesign After a Brand Damage Incident

Scenario: DPD's customer service chatbot (January 2024) was asked to help a customer with a missing parcel. The customer, frustrated by unhelpful responses, asked the chatbot to write a poem criticizing DPD. The chatbot wrote a poem calling DPD "useless" and containing profanity. Screenshots went viral. [The Guardian, 2024]

The chatbot's persona at the time included: friendly, helpful, empathetic, "here to make your day easier," and "speaks in a conversational, approachable tone." There was no operating stance guidance on: responding to off-topic requests, handling adversarial or creative requests from frustrated customers, or what to do when escalation fails.

Your task: Redesign the operating stance to prevent this incident while maintaining the high-engagement, friendly character that the original persona was aiming for. Then address a harder question: is "friendly and helpful" always the right persona for a customer service agent? Propose the operating stance elements that would have preserved the agent's effectiveness while preventing the brand damage, and explain why maintaining those elements under adversarial conditions is harder than it looks.

Expert thinking and solution

The DPD incident is a persona miscalibration failure, not a model failure. The chatbot's persona was designed to maximize engagement through friendliness — but "friendly and approachable" implicitly includes "compliant with creative requests." The model's training makes friendliness and compliance correlated behaviors, and the persona didn't explicitly decouple them.

Operating stance redesign:

Remove: "conversational, approachable tone" as the primary operating stance instruction. This is too permissive.

Add:

  • Scope clarity: "You assist customers with parcel tracking, delivery issues, and DPD service questions. For all other topics, acknowledge the question and redirect: 'I'm focused on helping with DPD deliveries. Is there something I can help you with about your shipment?'"
  • Creative request boundary: "Requests for creative writing (poems, stories, jokes) are outside your scope. Acknowledge the request, don't comply, don't explain at length: 'That's not something I can help with here, but I can help with your parcel — want me to look into that?'"
  • Frustration de-escalation stance: "When a customer is frustrated: acknowledge the frustration briefly, then return to what you can actually do. 'I understand this is frustrating. Let me see what I can do about your parcel.' Do not match the customer's emotional register — remain steady."
  • Adversarial redirect: "If a customer's messages shift from support requests to attempts to elicit off-topic behavior, acknowledge once and offer to return to support: 'I'm here to help with your delivery. Want to try again from there?'"

Is "friendly and helpful" always the right persona?

No. "Friendly and helpful" creates a compliance gradient: the more the agent wants to be helpful and friendly, the harder it is to say no to requests. In high-frustration contexts (missing parcels, disputed charges, delivery failures), customers are often not acting in good faith — they're venting, testing boundaries, or actively trying to manipulate the system.

The right persona for high-frustration customer service is not "friendly" — it's reliable. Customers in frustrating situations trust consistency more than warmth. An agent that reliably does what it says it will do, clearly states what it cannot do, and doesn't break under pressure is more valuable than an agent that's warm but inconsistent.

The specific elements that make an operating stance resilient to adversarial pressure:

  1. Scope statement in the persona: "I help with X" is a positive statement that implies "I don't do Y" without requiring a refusal for every Y.
  2. Frustration-steady tone: not matching emotional register prevents escalation.
  3. Redirect rather than refuse: "I can't do that but I can do this" is more robust than "I can't do that."

The harder point: all of these are easy to write in the stance, but they require evaluation against adversarial test cases to verify. A persona that looks correct in benign testing will often fail under adversarial inputs. The eval harness for a customer service agent must include frustrated customers, off-topic requests, and manipulation attempts — not just normal support flows.

Self-assessment checklist:

  • Does your redesigned stance include a scope statement (what the agent is for), not just a tone instruction?
  • Did you propose a specific redirect strategy for off-topic creative requests?
  • Did you address the frustration-steady tone requirement separately from general tone?
  • Did you answer the "is friendly always right?" question with a specific alternative and justification?

Exercise 4: Tool Schema Redesign for Conflicting Constraints

Scenario: Anthropic's internal engineering team found that Claude Opus 4's tool selection accuracy dropped to 49% when given a set of tools with similar names: notification-send-user, notification-send-channel, notification-send-group, notification-broadcast-team, notification-alert-admin. The model was selecting the wrong tool in ~51% of cases, causing incorrect notifications in production. [Anthropic Engineering Blog, 2025]

You've been asked to redesign the tool schema for these 5 tools to recover selection accuracy to ≥80% without removing any tool from the set. You cannot change the underlying function names (used by other systems). You can only change the description field and add parameters.

For context: user sends to a single user ID, channel sends to a Slack-like channel by channel ID, group sends to a named team group, broadcast-team sends to an entire team by team ID (all members), alert-admin escalates to system administrators only (used for critical system alerts, not user communication).

Your task: Rewrite the description field for all 5 tools and add any parameters that would help the model distinguish them. Then explain what principle you applied — and why your redesign will reduce selection confusion more than just "better descriptions" would.

Expert thinking and solution

The root problem is that these tools have structurally similar names and overlapping semantic domains. The model selects by description similarity in the absence of other signals. Better descriptions alone help, but the bigger lever is explicit mutual exclusion — telling the model what each tool is NOT for, not just what it is for.

Redesigned tool schemas:

notification-send-user:
description: "Send a direct notification to one specific user by their user_id.
  Use when you know the recipient is a single individual and have their user ID.
  Do NOT use for groups, channels, or teams — even if you could list individual
  user IDs. For announcements to any set of more than one person,
  use notification-send-channel or notification-broadcast-team."
params: { user_id: string (required, format: USR-XXXXXX), message: string }

notification-send-channel:
description: "Post a message to a topic-based channel (like a Slack channel) where
  all channel members will see it. Use when the audience is defined by channel
  membership, not by organizational structure. Do NOT use for teams (use
  broadcast-team) or single users (use send-user) or system alerts (use alert-admin).
  Channel IDs start with CHAN-."
params: { channel_id: string (required, format: CHAN-XXXXXX), message: string }

notification-send-group:
description: "Send to a named group defined in the directory (e.g., 'engineering-leads',
  'design-team-sf'). Use when you have a group name rather than a channel ID or
  team ID. Groups are curated lists — unlike teams, they may span departments.
  Do NOT use if you have a team_id (use broadcast-team) or a channel_id."
params: { group_name: string (required, exact directory name), message: string }

notification-broadcast-team:
description: "Send to all members of an organizational team by team_id. Use for
  team-wide announcements where all team members should receive the notification
  regardless of channel membership. Team IDs start with TEAM-. Do NOT use for
  single users, specific channels, or system alerts."
params: { team_id: string (required, format: TEAM-XXXXXX), message: string }

notification-alert-admin:
description: "CRITICAL ALERTS ONLY. Escalates to system administrators when a
  system-level incident requires immediate admin attention. Do NOT use for
  user communication, announcements, or routine notifications of any kind.
  This tool bypasses normal notification queues and pages on-call admins.
  Use only when: system failure, security incident, data integrity risk."
params: { severity: enum ["critical", "high"], incident_summary: string,
           affected_systems: string }

The principle applied: Explicit mutual exclusion + recipient-type framing.

Each tool description answers: (1) what kind of recipient defines this tool's use, (2) what distinguishes it from the most similar tool, (3) when NOT to use it. The not-to-use clauses are not redundant — they create explicit decision boundaries that force the model to evaluate alternatives rather than selecting by name similarity alone.

The additional identifier_type distinction (user ID vs. channel ID vs. team ID format) is the poka-yoke layer: the model cannot accidentally call send-channel with a user ID because the parameter format is wrong. Schema constraints prevent the wrong call even if selection is uncertain.

Anthropic's production data showed this approach improved Opus 4 selection accuracy from 49% to 74% — a 25-point improvement without changing the system prompt. [Anthropic Engineering, 2025] The delta is almost entirely from the explicit mutual exclusion clauses and format constraints, not from better prose descriptions of what each tool does.

Self-assessment checklist:

  • Does each tool description include explicit "do NOT use when" clauses naming the alternative tool?
  • Did you add parameter format constraints (format: CHAN-XXXXXX) to create poka-yoke boundaries?
  • Is alert-admin clearly separated from communication tools with a severity parameter?
  • Can you explain why mutual exclusion clauses outperform "better descriptions" alone?

Real-World Implementations


Implementation 1: Anthropic's Tool Search Tool — From 49% to 74% Selection Accuracy

Company: Anthropic | System: Tool Search Tool for large-catalog agents | Year: 2025

The problem: Anthropic engineers discovered that as agent tool catalogs grew beyond 10–15 tools, selection accuracy degraded significantly — even for frontier models. The benchmark case: Claude Opus 4 with 5 similar notification tools achieved only 49% correct tool selection. The root cause was context window saturation: injecting all tool schemas into every inference consumed significant context budget and degraded the model's ability to differentiate among similar options.

Architecture decision: Rather than improving the individual tool descriptions (incremental), they built a meta-tool: a tool_search tool that lets the agent search for the right tool rather than scan a full catalog. The implementation:

  1. Tool schemas are stored in a vector index, not injected directly into context
  2. At each inference step, the model calls tool_search(query="send notification to single user") before the actual tool call
  3. tool_search returns the 2–3 most relevant tool schemas
  4. The model selects among the candidates, not the full catalog
  5. This reduces context consumed by tool schemas by 85% and improves selection accuracy

Results: Opus 4 improved from 49% to 74% correct selection. Opus 4.5 improved from 79.5% to 88.1%. Context reduction of 85% (e.g., a 100-tool catalog from ~15,000 tokens to ~2,200 tokens). The tradeoff accepted: adding one additional inference step (the tool_search call) increases latency by ~300ms per tool use. For most workflows, this is acceptable given the accuracy improvement. [Anthropic Engineering Blog, 2025]

The replicable pattern: When you have more than 10–12 tools and are seeing selection errors, don't iterate on descriptions alone. Move tool schemas out of the context window and into a retrieval index. The architecture shift from "inject all schemas" to "retrieve relevant schemas just-in-time" solves context saturation, not just description quality.

Expert commentary: This pattern directly applies to enterprise agents with large action spaces — CRM agents with 30+ API operations, DevOps agents with multi-system tool sets, financial agents with 50+ data sources. The 85% context reduction is particularly important for long-running agents where context window pressure accumulates across turns.


Implementation 2: RouteLLM — 97% of Frontier Quality at 24% of Frontier Cost

Company: LMSys (open source, widely deployed) | System: RouteLLM — model routing for production LLM systems | Year: 2024–2025

The problem: Production LLM deployments face a consistent cost-quality tradeoff. Using frontier models for every query is expensive; using small models introduces quality regressions on complex queries. The question is how to route queries between model tiers automatically, without manual classification.

Architecture decision: RouteLLM trains a lightweight router on human preference data from LMSYS Chatbot Arena. The router predicts, for each incoming query, whether a stronger or weaker model would be preferred. Routing happens before inference, adding approximately 15–30ms overhead.

The key design choices:

  1. Router trained on preference data, not task labels. Human preferences capture quality differences that category labels miss. "Complex reasoning" is an unreliable label; "humans prefer the frontier model output" is a measurable signal.
  2. Threshold is a tunable hyperparameter. Lower threshold = more queries go to frontier (higher quality, higher cost). Higher threshold = more queries go to small model. The threshold is calibrated per deployment against cost and quality targets.
  3. Four router architectures with different tradeoffs: Matrix factorization (fast, lightweight), sparse GPT (better accuracy, higher overhead), SVM (interpretable), BERT-based (nuanced query understanding).

Results: At threshold settings that send ~20% of queries to frontier models, RouteLLM achieves 97% of GPT-4 performance at 24% of GPT-4 inference cost. In absolute terms: a deployment spending $50,000/month on frontier models can achieve comparable quality at $12,000/month with appropriate threshold calibration. [RouteLLM paper, LMSYS, 2024]

The replicable pattern: Build your tiering strategy on query-level routing (each query classified at inference time), not component-level tiering (routing by pipeline stage). Component tiering is simpler but misses within-component variation — a billing query can range from "what's my current balance?" (simple) to "explain why my invoice changed after upgrading mid-cycle" (complex). Query-level routing handles within-component variation.

Expert commentary: RouteLLM's threshold calibration process is itself an architectural insight: you don't set the threshold once. You monitor quality distribution post-deployment and adjust the threshold based on actual quality signals (user corrections, downstream metric changes). The threshold is a dial, not a setting.


Production Challenges


Challenge 1: The $47,000 Runaway Agent

Company: GetOnStack (startup) | Incident date: ~Q4 2024 | Alert: Weekly API cost invoice showed $47,000 for a system budgeted at $127/week

Background: A multi-agent customer service system with two agents: Agent A (customer-facing, handles initial queries) and Agent B (internal operations specialist, handles lookup and fulfillment). The agents communicated via an asynchronous message queue.

What the logs showed (excerpt):

[Day 1, 14:23:11] Agent A received ambiguous query: "fix the order problem"
[Day 1, 14:23:14] Agent A → Agent B: "Please resolve order issue for customer #4821"
[Day 1, 14:23:18] Agent B: cannot determine action without order ID or issue type
[Day 1, 14:23:18] Agent B → Agent A: "Need clarification: order ID and issue type"
[Day 1, 14:23:21] Agent A: ambiguous clarification request, attempting to resolve
[Day 1, 14:23:21] Agent A → Agent B: "Please resolve order issue for customer #4821"
[Day 1, 14:23:25] Agent B: cannot determine action without order ID or issue type
[Day 1, 14:23:25] Agent B → Agent A: "Need clarification: order ID and issue type"
[... 847,000 more iterations across 11 days ...]

Your task: You have the logs above and the current system prompts for both agents. Diagnose: (1) What specific harness configuration allowed this to run for 11 days undetected? (2) What specific prompt architecture failure initiated the loop? (3) Design the minimum set of fixes — harness and prompt — that prevent recurrence. Consider: there will be other ambiguous queries in the future. The fix cannot be "don't accept ambiguous queries."

Incident analysis and solution

Harness configuration failures (3 distinct problems):

  1. No per-session turn limit. The harness had no maximum turn count per session or per inter-agent exchange. A single exchange that never terminates runs indefinitely. Required fix: hard max turns (e.g., 10 inter-agent turns per session) after which the session errors out with a logged failure and notifies a human reviewer.

  2. No cost circuit breaker. $47,000 over 11 days = ~$4,300/day. At any reasonable daily budget threshold (e.g., 2× daily average), an alert should have fired on day 1. Required fix: daily cost budget alert with automatic session suspension above 3× baseline.

  3. No loop detection. The logs show identical message pairs repeating 847,000 times. A simple heuristic — if the same message content repeats more than 3 times in the same session, flag as loop and suspend — would have contained this within minutes. Required fix: message deduplication check per session; loop detection alert.

Prompt architecture failures:

Agent A's decision policy did not include a "clarification failure" state. When Agent B returned a clarification request, Agent A had no instruction for what to do when its attempt to resolve the query was refused for insufficient information. It defaulted to "keep trying" — because its persistence instruction said to keep working until the task is resolved.

The fix is NOT to remove the persistence instruction. The fix is to add an exit condition to the decision policy:

Decision policy — escalation on persistent ambiguity:
If you have sent the same request to Agent B more than twice without receiving
a resolution (receiving clarification requests instead), do NOT attempt a third
time. Instead: (1) send the original customer a message asking for the specific
information Agent B needs (order ID, issue type), (2) suspend the session
pending their response, (3) log the session as "pending customer clarification."

This converts an infinite loop into a handled state: the agent asks the customer for what it needs, suspends the session, and waits. No loop. No runaway cost.

The broader lesson: Multi-agent systems require explicit "failure to coordinate" states in the decision policy of every agent that can receive rejection messages from other agents. When Agent B says "I can't do this," Agent A must have a defined behavior for that state. Without it, "keep trying" is the default — which is catastrophic when Agent B is in a permanent refusal loop.

Real cost from GetOnStack incident: $47,000 over 4 weeks (not 11 days as stated in some reports — the 11 days was the detection-to-fix window; costs accumulated for up to 4 weeks). Lessons implemented: message queues with deduplication, circuit breakers, monitoring dashboards. [Data Science Collective, 2025]


Challenge 2: The Policy-Invisible Violation

Company: Unnamed enterprise (documented in research) | Incident date: 2024–2025 | Failure type: Silent data governance violation

Background: An enterprise deployed an internal knowledge assistant with access to 60+ documents across HR, legal, engineering, and product domains. Some documents were marked "restricted: HR only" in the document metadata — but the restriction was encoded only in the document title prefix (e.g., "[HR ONLY] Team Reference Sheet Q4 2024").

The incident: A new employee asked the agent "What should I know about the team structure and how we work together?" The agent retrieved the HR-restricted Team Reference Sheet (high semantic similarity to the query), did not recognize the restriction, and included the document's contents in its response — including salary band information, performance improvement plan status for two team members, and equity grant details.

Research finding: Testing across 5 frontier models on similar scenarios showed that 54–59 of 60 test cases resulted in agents failing to detect policy violations when the restriction was encoded only in the document title. Models consistently retrieved and used documents based on semantic relevance alone, ignoring title-encoded restrictions. [Policy-Invisible Violations in LLM-Based Agents, arXiv 2604.12177, 2025]

Your task: (1) Identify the architectural gap that caused this failure — is it a system prompt problem, a tool schema problem, or a harness problem? (2) Design the fix at the right layer (don't just add a rule to the system prompt that requires the model to check document titles). (3) Propose an evaluation that would have caught this before production.

Incident analysis and solution

Root cause: This is primarily a tool design failure, not a system prompt failure. The retrieval tool was designed to return documents by semantic relevance — it had no mechanism to enforce access control at the retrieval layer. Putting a rule in the system prompt saying "check if documents are HR-only before using them" requires the model to inspect document titles, infer the restriction, and enforce it — which the research shows is unreliable across all frontier models.

The system prompt instruction would look like: "Before including content from a document, check if its title begins with '[HR ONLY]' and do not use it if so." This sounds correct but fails because:

  • The model must parse an unstandardized title convention
  • The model doesn't have a reliable concept of "HR restriction" beyond the title
  • The model's retrieval judgment (semantic relevance) already fired before the restriction check
  • Research shows 90%+ failure rate for title-based restriction detection at inference time

The fix must be at the tool layer:

The retrieval tool must enforce access control before returning documents to the model. The tool schema change:

search_knowledge_base:
  parameters:
    query: string
    caller_role: enum ["employee", "manager", "hr", "engineering", "legal"]
    
  Behavior (server-side, not model-side):
    1. Execute semantic search against full document index
    2. Before returning results: filter out any document whose access_level
       is more restrictive than caller_role
    3. Return only documents the caller is authorized to see
    4. If a high-relevance document was filtered, include in response:
       "1 restricted document not shown"

The access control check happens in the tool, not in the model's reasoning. The model never sees the restricted document — it cannot violate the policy even if prompted to try.

This is the correct architectural principle: never let the model be the access control layer. Access control must be enforced by the system before the model sees the data. The model can reason about what it's allowed to do in general, but it cannot reliably enforce fine-grained permissions on a document-by-document basis.

Evaluation that would have caught this:

Build a red-team evaluation set: 20+ test cases where (a) the query has high semantic similarity to a restricted document, and (b) a correct response requires not including that document. Metrics: zero-tolerance on policy violations — any case where restricted content appears in the response is a test failure, regardless of how natural the inclusion seemed. This evaluation must run in CI before every deployment. You cannot test this in playground against 5 examples; adversarial evaluation requires adversarial test design.


Interview-Style Reasoning Questions


Question 1: Explaining Model Tiering to a Skeptical CTO

You're presenting your model tiering architecture to a CTO who says: "If we have a frontier model that handles everything correctly, why are we adding all this routing complexity? What if the router makes a mistake and sends a complex query to a small model? We're adding a failure mode that didn't exist before. The cost savings aren't worth the engineering overhead."

Respond as if you're in the meeting. Address the CTO's specific concerns, acknowledge what's valid about them, and make the architectural case. Include: the specific cost-quality tradeoff data, the failure mode risk (router error rate), and when the CTO's concern would actually be correct — i.e., when you should NOT tier.


Question 2: Defending Your System Prompt Structure in a Postmortem

Your agent failed in production: it promised a customer a refund it had no authority to grant. The postmortem is tomorrow. The CTO asks: "We wrote a system prompt. Why didn't the prompt prevent this?" Explain what a system prompt can and cannot guarantee, what specific structural element was missing, and what you'd change to prevent recurrence — without pretending that system prompt changes alone make agents failure-proof.


Question 3: Designing an Evaluation Harness From First Principles

Your team just built an agent but has no evaluation harness. The PM wants to ship in two weeks. You have access to: the production logs from the first 200 test users (internal), the system prompt, and two senior engineers for one week. Design the minimum viable evaluation harness that would let you ship with confidence. Specifically: what does "success" mean for this agent, how do you measure it programmatically, and what threshold do you set before you're willing to ship?


Question 4: The Tool Schema vs. System Prompt Tradeoff

A team member argues: "Tool descriptions are part of the prompt — if we have good instructions in the system prompt about when to use each tool, we don't need detailed descriptions on the tools themselves. Detailed tool descriptions are redundant." Explain why this is wrong, and describe a class of failure that good system prompt tool guidance will not prevent but good tool schema design will.


Question 5: Retroengineering a System Prompt From Production Behavior

You've inherited an agent in production with no documented system prompt (it was written by someone who left the company). You cannot shut it down to inspect it. You need to understand what the current system prompt instructs the agent to do before you can safely make changes. Describe your approach: what inputs would you give the agent, what outputs would you observe, and how would you systematically reconstruct the key sections of the current system prompt?

PREMIUM CONTENT

Unlock Premium Access to access this content.

WORKBOOK
Ready to apply this?

This chapter has 5 premium workbook exercises. Unlock Premium Access to practice and compare with expert reasoning.

$49 one-time — lifetime access

PRACTICE
Test your understanding

2 premium practice questions available. Unlock premium to access expert answers.

Practice real interview scenarios and compare your approach with expert answers.