The New Product Paradigm

Agency alone is not the moat — the question is whether the workflow justifies building an agent at all.

The enterprise software industry is not adding a new category — it is replacing the fundamental unit of software value.

The Shift Nobody Has Fully Priced In

For thirty years, enterprise software was a productivity amplifier. You hired humans, gave them tools, and the tools made each human more effective. The value was in the enhancement. The business model was the seat.

That model is ending.

The shift is not about AI being "powerful" — that framing is already stale. The shift is structural: software is transitioning from a productivity tool to an autonomous digital workforce. AI agents are now executing multi-step, multi-hour workstreams with distinct job titles, operational budgets, and management structures. When that happens at scale, the total addressable market stops being the IT budget ($400–600 billion annually) and starts being the $11 trillion global labor market.

This changes everything downstream: how products are priced, how they create moats, how they reach customers, and what makes them defensible. Teams that understand the old model and try to apply it to agentic AI are not just leaving money on the table — they are building businesses that are structurally unsound from the first pricing decision.

The paradigm shift from SaaS seats to labor outcomes

Sources: Bessemer Venture Partners 2025, Menlo Ventures AI Landscape 2025, McKinsey State of AI 2025

Systems of Record vs. Systems of Action

The most useful conceptual frame for understanding what's changed is the distinction between a System of Record and a System of Action.

A System of Record stores and retrieves data. Your CRM, your ERP, your database. These systems made the SaaS era possible. The value was in organizing information so humans could act on it.

A System of Action completes labor. It takes an input, makes decisions, uses tools, and produces an output that advances real work — not just a recommendation for a human to act on, but the action itself. Ticket → resolution. Issue → pull request. Contract → first draft. Knowledge query → sourced answer with follow-up task created.

The highest-value agentic products are ones that successfully transition a workflow from the former to the latter — where the agent is not a smarter interface on top of a System of Record, but a System of Action that owns the outcome delivery.

This is the frame you need when evaluating whether something is worth building. Not "does AI make this workflow better?" but "does this product become the default route through which work in this domain gets completed?"

Think about it: Think about an AI tool you currently use or have recently evaluated. Is it operating as a System of Record (smarter retrieval, better recommendations) or a System of Action (completing work autonomously)? What would need to change — in the product or in how it is deployed — for it to cross that line?

Expert thinking

Most AI tools people use daily are Systems of Record with better UI. GitHub Copilot completes code but the developer still owns the pull request. ChatGPT drafts text but the human still sends the email. These are legitimate products — but they are not the paradigm shift.

A System of Action completion looks different: Intercom Fin resolves the support ticket without a human touching it. GitHub's coding agent turns a GitHub issue into a pull request that gets pushed to draft. The delta is not about sophistication — it is about who owns the outcome.

The practical implication: when evaluating a tool, ask whether the human is still the last required actor before the outcome is delivered. If yes, the tool is a productivity enhancer — valuable, but not a System of Action. The business model for a productivity enhancer (per-seat SaaS) and the business model for a System of Action (per-outcome pricing) are structurally different. Conflating them is one of the most common pricing mistakes in agentic AI.

Self-assessment checklist:

Did you identify the specific moment in the workflow where human action is still required?
Did you distinguish between a tool that makes humans faster (System of Record) vs. one that replaces the human action (System of Action)?
Did you note what would change — in product scope or workflow integration — to cross the line?

The Build/Don't-Build Filter

The first strategic decision in agentic AI product development is not which model to use, or how to architect the system. It's whether the problem deserves an agent at all.

This is not a philosophical question — it has a practical answer. The strongest agentic use cases share four structural properties:

Nuanced judgment or unstructured inputs. The problem cannot be solved by a deterministic rule without losing too much fidelity. If you could write an IF/THEN decision tree that handles 95% of cases correctly, you probably don't need an agent — you need a better automation.

Executable tools. The agent can reach APIs, databases, or external services to take real actions, not just produce text. An agent that can only write recommendations but cannot execute them is an expensive search engine.

Observable success criteria. Within a short cycle, outputs can be checked — by rules, tests, expert review, or downstream business metrics. If you can't score success quickly, you can't improve the system and you can't price by outcome.

Tolerance for an escalation path. The workflow can accommodate a human review, approval, or correction at defined points. Some workflows are too high-stakes or too ambiguous for any degree of automation without a fallback.

The empirical data on where agents actually work is sobering and important: on a benchmark of multi-step tasks, models achieved near-100% success on tasks humans complete in under four minutes — but under 10% success on tasks humans complete in four or more hours. Full autonomy is a product and risk decision, not a branding claim. The right question is not "can an agent eventually do this?" but "does this workflow, at this scope, already have the verifiability and tool access that make it a viable first wedge?"

Build Now vs. Don't Build matrix with the 4-property filter

Sources: Anthropic 2026 State of AI Agents, METR benchmark analysis

Think about it: Your VP of Product proposes building an agentic AI to automate the quarterly business review process — collecting data from 12 different internal systems, generating narrative sections, and distributing drafts for executive review. Score this workflow against the four-property Build/Don't-Build filter. Which properties pass cleanly? Which ones give you pause, and why?

Expert thinking

This is a workflow that looks agentic-ready but has a critical problem: low-frequency, high-variability, and ambiguous success criteria.

Nuanced judgment: Partial pass. The narrative writing requires judgment, but data collection from known systems is deterministic. Mixed.

Executable tools: Likely pass — the systems exist and most have APIs. Tool access is achievable.

Observable success criteria: This is the red flag. What does a "good" QBR look like? Different executives will have different definitions. There is no fast objective scoring function. You cannot improve the system without a labeled eval dataset that does not yet exist.

Escalation path tolerance: Partial pass. Executives reviewing drafts is a natural escalation path — but the escalation rate will be high (near 100%) until the system has learned what "good" means for this org.

The correct recommendation is not "don't build" — it is "build narrow first." Start with one section of the QBR (e.g., pipeline summary), one data source, and one executive as the reviewer and labeler. Build the eval corpus before expanding scope. At current state, building the full QBR automation would produce a system that is impressive for demos and unreliable in production.

Self-assessment checklist:

Did you score each of the four properties separately, not as a single pass/fail?
Did you identify observable success criteria as the key bottleneck?
Did you propose a narrower initial scope rather than a binary build/don't-build?

Not All Agents Are the Same Product

Before choosing a workflow, it helps to understand which type of agent you are actually building — because the architecture, pricing model, and governance requirements differ significantly by type.

Task agents execute defined workflows with human oversight at key checkpoints. Support agents, coding assistants, legal review agents. This is where most enterprise PMF currently lives, and where the highest-quality demos translate into real production deployments.

Autonomous agents plan and execute complex multi-step workflows with minimal human intervention. Engineering agents, research agents, data processing pipelines. Viable in production, but requires mature evaluation infrastructure and clear risk boundaries.

Agent networks involve multiple autonomous agents collaborating across organizational boundaries. Still largely aspirational for most enterprises in 2026 — the coordination overhead is substantial and most teams overestimate their readiness for it.

The diagnostic problem: Menlo Ventures analyzed 495 enterprise deployments and found that only 16% of enterprise "agent" deployments are true agents. Most are fixed-sequence automations with an AI branding layer. Knowing which type you are actually building — and which type is actually appropriate for the workflow — is the honest starting point.

Where Production PMF Has Already Appeared

Some categories have achieved real production PMF, and the pattern is consistent. They are not categories chosen because they demo well. They are categories with structural properties that make them excellent first wedges.

Customer support has clear SOPs, known escalation paths, ticket-level outcome metrics, and fast ROI measurement. Intercom Fin is the clearest case: a 67% average resolution rate across 7,000+ teams, priced at $0.99 per resolved conversation. The wedge is narrow, the outcome is verifiable, and the pricing maps directly to the customer's business value (avoided human support cost).

Software engineering is digitally native, tool-rich, and highly verifiable. GitHub coding agent turns issues into background work, pushes to draft PRs, and never merges automatically — every output has a clear review artifact (the diff) that the human can evaluate. One Chicago study found 39% more PRs merged with an AI coding agent; a METR trial with experienced developers on unfamiliar codebases found 19% slower performance. The lesson: agentic products win when tasks are decomposable, the environment is instrumented, and verification is strong.

Legal and professional workflows benefit from domain-specific corpus, document-heavy processes, and human judgment at the decision edge. Harvey expanded from six to sixty jurisdictions, covering 500+ global legal sources, and achieved 42% of the Am Law 100. Bridgewater used AI for large-scale agreement review with 95%+ time savings; vendor contract review dropped from two days to two hours.

Enterprise knowledge demonstrates how a search wedge becomes a System of Action moat over time. Glean serves as the clearest example — Confluent reported 15,000+ hours saved monthly; Duolingo reported 500+ hours saved and 5x ROI.

The pattern across all of these: a named outcome (resolution, pull request, first draft, sourced answer), a short feedback loop for verifying success, and a clear existing budget line for the pain being addressed.

Think about it: You're evaluating two potential agentic product ideas: (A) an agent that automatically processes and categorizes incoming vendor invoices, routing them to the correct approval queue based on department, amount, and vendor history; (B) an agent that helps marketing teams ideate and brainstorm campaign concepts by synthesizing competitor analysis, audience data, and past campaign performance. Using what you know about production PMF patterns, which would you bet on as a first wedge? What makes one structurally stronger than the other?

Expert thinking

Option A (invoice processing) is the better first wedge by a significant margin — not because it sounds more impressive, but because it passes every structural test.

Invoice processing has: high frequency (weekly or daily), explicit dollar cost (AP team time), clear success criteria (correctly routed with correct coding vs. not), real tool access (ERP, approval system, vendor database), authority to act (routing is a write action the agent can own), and reversibility (misrouted invoices can be corrected before payment). Score: 6/7.

Option B (marketing ideation) has: moderate frequency, vague success criteria ("good brainstorm" is unmeasurable), no tool writes (agent produces text the human still acts on), and subjective output quality. Score: 2–3/7.

The mistake teams make is evaluating by demo quality, not structural properties. A marketing brainstorming agent demos beautifully — the AI produces compelling ideas and the VP of Marketing is excited. But there is no way to score quality consistently, no improvement loop, and no outcome the product can own. Invoice processing demos less excitingly but converts to production because the ROI is auditable and the workflow is owned end-to-end.

Self-assessment checklist:

Did you score both options against the four Build/Don't-Build properties?
Did you identify verifiability as the key differentiator, not capability?
Did you consider whether the agent can own an outcome (writes/actions) vs. only produce recommendations (text)?

The Thin-Wrapper Trap

Here is a framing from Bessemer Venture Partners that every product team building agentic AI should internalize.

Supernovas are AI companies with extraordinary growth and approximately 25% gross margins. They captured market share quickly on a compelling demo and an outcome-based promise. But they haven't solved the inference cost problem. Revenue scales; margins don't. Customer retention is fragile because the unit economics only work when inference costs are subsidized.

Shooting stars are durable businesses with approximately 60% gross margins, strong PMF, and loyal customers. The difference isn't model quality — it's workflow depth, evaluation infrastructure, and the ability to route tasks to right-sized models rather than defaulting to frontier inference for everything.

The thin-wrapper risk looks like this: your product is primarily a wrapper around a foundation model API with a well-crafted prompt and a polished UI. You have no domain data advantage, no evaluation corpus, no SOP capture, and no structural workflow position. The foundational labs — OpenAI, Anthropic, Google — are moving down the stack. They will build native interfaces for the highest-value workflows. When they do, products with no moat below the UI layer will be displaced.

The benchmark for durability: can you maintain 60%+ gross margins as you scale? If the answer requires inference costs to fall dramatically, or assumes frontier model performance at every step, or depends on customers not noticing the error rate — you don't yet have a real business. You have a supernova, and supernovas are spectacular but short-lived.

What's next: Advanced Practice

Ready to stress-test what you've learned? The advanced exercises below put you in real production scenarios — messy, ambiguous, no clean answers. The first exercise starts here:

Advanced Applied Exercise preview: You're three months into building an agentic product for contract review at a mid-size law firm. Resolution rate on standard NDAs is 78% — above the 90% threshold you'd need for outcome pricing. Your CTO wants to immediately expand to employment agreements and IP licensing. You have one customer reference. What do you do, and what data do you require before proceeding?

Real-World Implementation preview: In 2024, a major insurance company rebuilt their claims intake process as an agentic system. The architecture decision that made or broke the project wasn't the model choice or the integration depth — it was a single product scoping decision made in week two of the build...

Interview Reasoning preview: A board member asks: "If OpenAI builds a native support agent tomorrow that's technically better than our product, what's our 18-month moat?" Walk through your answer — including what you would have needed to have built in the prior 6 months for that answer to be credible.

Subscribe to unlock the full advanced practice section.

Advanced Applied Exercises

Exercise 1: The Expansion Dilemma

You've been running an agentic billing reconciliation product for a B2B SaaS company for 4 months. Current state:

Resolution rate: 81% (below the 90% threshold for outcome pricing)
Average cost-to-serve per reconciled invoice: $0.38
Current pricing: $0.75/resolved invoice
Gross margin: 49%
Customer count: 8, monthly churn: 0%

Your largest customer (35% of ARR) is asking you to expand into purchase order matching — a related workflow with an estimated 3x larger volume but significantly more variance in document formats. Expansion would require 6 weeks of engineering investment and would temporarily depress resolution rate during the learning period.

Your Series A investor is pressuring for growth metrics before the next raise in 5 months.

What do you do? Structure your answer around: (a) the expansion decision, (b) what you'd negotiate with the investor, and (c) what you'd tell the customer.

Expert thinking

The right answer is: do not expand yet, but use the request to accelerate the path to expansion.

The 81% resolution rate is below viability threshold for outcome pricing — and below the implicit reliability bar for enterprise expansion. If you expand into a new, higher-variance workflow now, you will depress resolution rate on both workflows simultaneously, compromise the core product's reliability, and have a worse investor story in 5 months, not a better one.

What to do instead:

Tell the customer: "We'll build PO matching, and here's the timeline — but we go into it only when billing reconciliation is at 90%+. That's the standard we hold ourselves to so you can trust the product." This is a strong answer because it demonstrates product discipline, not weakness.
Use the next 6 weeks to aggressively close the 81% → 90% gap: analyze the 19% that isn't resolving, identify the top 3 failure patterns, fix them. This is also the eval work you'll need for PO matching anyway.
For the investor: "We have a signed expansion LOI from our largest customer contingent on hitting our 90% resolution threshold, which we're targeting in 8 weeks." That is a better growth metric than premature scope expansion with declining reliability.

The wrong move: expand because the customer is large. Customer size is not a substitute for product readiness. A failed expansion into PO matching with your largest customer costs more than a delayed expansion.

Self-assessment checklist:

Did you resist expansion pressure from a large customer without meeting the reliability threshold?
Did you identify the 81% → 90% gap as the highest-leverage work before expansion?
Did you reframe the investor conversation around a credible near-term milestone rather than premature growth metrics?

Exercise 2: Competing Against the Lab

It's Q2 2026. Anthropic has announced "Claude Work" — a native enterprise workflow agent with deep integrations to Google Workspace, Slack, and Salesforce. Early benchmarks suggest it handles support, document generation, and research workflows at quality comparable to your product. Claude Work is priced at $50/seat/month. Your product is priced at $0.99/resolution.

You serve 300 enterprise customers in the B2B SaaS support workflow. Your resolution rate is 71%. Your top three customers have been with you for 14 months.

A board member is asking: "Do we have a moat, or are we a feature?" What is your answer, and what would you need to build in the next 90 days to make that answer credible?

Expert thinking

The honest answer to "feature or moat?" is: at 71% resolution rate with no eval corpus, no domain SOP capture, and no workflow position beyond Zendesk integration — you are closer to a feature. But you have the raw materials to become a moat in 90 days if you move on the right things.

What Claude Work cannot replicate in 90 days:

Your labeled outcome dataset from 300 customers' support workflows (every resolved/unresolved ticket is training data)
Your customer-specific SOP knowledge (what "resolved" means for a fintech customer vs. an e-commerce customer vs. a healthcare SaaS)
The eval corpus you've been building implicitly through production usage

What to build in 90 days:

Formalize the eval corpus: label your existing resolution data by workflow type, ticket complexity, and failure mode. This is the asset Claude Work doesn't have for your specific customers.
Fine-tune a smaller model on your domain corpus. Target: match 71% resolution rate at 80% lower inference cost. This is your margin moat.
Launch per-customer resolution benchmarks. Your 14-month customers should have resolution rate trends, not just current rate. Show compounding improvement. That's the story Claude Work cannot tell about your customers.

The $50/seat vs. $0.99/resolution comparison is actually in your favor once outcome pricing is understood — for a support team resolving 2,000 tickets/month where 71% resolve without a human, your cost is $1,410/month vs. Claude Work at $50 × (size of team). For a 40-person support team, Claude Work at $2,000/month is 42% more expensive with no resolution guarantee.

Self-assessment checklist:

Did you identify the specific assets (eval corpus, SOP capture, domain data) that are hard to replicate quickly?
Did you propose concrete work for the 90-day window rather than defensive positioning?
Did you correctly frame the pricing comparison in terms of outcomes, not seats?

Exercise 3: The Pricing Model Reset

You're three months post-launch with 45 enterprise customers on a $1,500/month flat fee (effectively seat pricing for teams of 5–20 agents). Monthly gross margin is 54%. Your largest customers are your highest-cost accounts — they have 15+ agents and generate 4x the inference cost of your smallest customers, for the same $1,500/month.

You want to move to outcome-based pricing ($1.20/resolved ticket), but your Head of Sales argues this will cause churn because customers can't predict their monthly bill. Your CFO argues the current model has good margin. Your CTO says the fine-tuning you need to drive resolution rate higher requires outcome-labeled data you don't currently capture.

Who is right? What do you do?

Expert thinking

The CTO is most right, the CFO is defending a false stability, and the Head of Sales is raising a real objection that has a solvable answer.

The CFO's mistake: 54% gross margin on average masks the fact that your highest-value customers (largest teams, most engaged) are your most expensive. As these customers grow, your costs grow, your revenue stays flat. This is the seat-pricing trap in slow motion. The 54% margin is already declining if large customers are your fastest-growing segment.

The Head of Sales' objection is real but solvable: "Customers can't predict their bill" is the most common objection to outcome pricing. The answer is hybrid pricing: $500/month platform fee (predictable base) + $0.90/resolution above 500 resolutions/month. The floor makes the bill predictable. The variable component captures upside as usage scales. CIOs can budget for it.

The CTO is pointing at the most important issue: Without outcome-labeled data, you cannot build the eval corpus that enables model fine-tuning, which is the path to higher resolution rates and lower inference costs — i.e., the path to 60%+ gross margin at scale. The current pricing model actively prevents you from building your moat.

Transition plan: announce pricing change 90 days out. Grandfather existing customers at a hybrid model. Capture resolution labeling data from day 1 of the new model. Use the 90-day runway to build the outcome-data pipeline before it's operationally required.

Self-assessment checklist:

Did you identify that the current model's "good margin" is a lagging indicator, not a leading one?
Did you propose a hybrid model to address the predictability objection?
Did you connect outcome pricing to eval corpus capture as the data flywheel mechanism?

Real-World Implementations

Implementation 1: Zendesk AI and the Outcome Pricing Adoption Curve

Zendesk launched AI-powered ticket resolution in 2023 and faced the same pricing architecture question as every support AI vendor. Their published approach revealed a hybrid model: a base automation fee plus per-resolution charges only above a volume threshold.

Architecture decision: Rather than pricing every resolution, Zendesk included a baseline resolution volume in the subscription (up to X% of ticket volume), then charged per resolution above that threshold. This design solves the predictability objection — most customers never exceed the included volume and see no variable charges — while capturing upside from customers with high automation success rates.

Expert commentary: This is a sophisticated pricing architecture that trades margin optimization for adoption speed. Including a resolution floor in the base fee means some customers get outcome-based value without paying outcome-based prices — margin-dilutive in the short term, but adoption-accelerating. The long-term bet is that high-resolution-rate customers who've experienced the outcome model will accept outcome pricing at renewal. Whether this bet pays off depends entirely on whether the resolution rate is high enough that customers see compelling ROI from the included volume before they ever pay a variable charge.

Implementation 2: ServiceNow's Agentic Platform — System of Action Transition

ServiceNow reported $355 million in annualized value from its internal agentic AI deployment, making it one of the largest published ROI cases for an enterprise using agentic AI on its own workflows.

Architecture decision: ServiceNow didn't deploy a standalone AI product. They embedded agentic workflows into existing ServiceNow workflow objects — incidents, requests, changes — so that every agent action occurred within the same ITSM context their team already used. The agent's outputs (resolutions, recommendations, escalations) were ServiceNow records, not a separate interface.

Expert commentary: This is the workflow surface distribution advantage made explicit. The agent succeeded not because of model quality but because every AI action was already in the context where a human would review it. Trust was built implicitly — the artifact was familiar. The decision to keep AI outputs as native workflow objects rather than building a separate AI interface is the key architectural call. It means zero change management for end users, immediate audit trail via the existing ITSM platform, and no distribution problem because the agent lives where work already happens.

Production Challenges

Challenge 1: The Resolution Rate Plateau

You've been running a support agent for 6 months. Resolution rate climbed quickly from 45% to 67% in the first two months, then plateaued. For the last four months, it has been 67% ± 2%. Your customer success team says customers are "fine with it" but your 90%-before-expansion milestone is stalling the product roadmap.

Diagnosis: you have access to full execution traces, conversation logs, and human-agent handoff reasons. Design a systematic approach to break through the plateau.

Expert analysis

The 67% plateau is a data problem, not a model problem. Here's the diagnostic framework:

Step 1 — Segment the 33%. Break the non-resolutions into categories: (a) Fin attempted and failed (produced an answer the customer rejected or escalated), (b) Fin did not attempt (routed to human immediately), (c) Fin produced no answer (explicit uncertainty). Each category has a different root cause and fix.

Step 2 — Find the top 3 failure patterns. In a typical support workflow, 3 failure categories account for 80%+ of non-resolutions: missing documentation coverage (customer asked about a feature with no help article), ambiguous question routing (Fin couldn't classify the intent correctly), tool call failure (Fin needed to look up an account but the API call failed or returned an unexpected schema).

Step 3 — For each category, the fix is different:

Missing documentation: this is a content gap, not an AI gap. Build the missing articles. Resolution rate will improve automatically.
Routing failure: add example conversations to the routing classifier. Fine-tune on 20–50 examples per failure class.
Tool call failure: add schema validation and better error handling. Log and alert on failed tool calls separately.

Step 4 — Track leading indicators. Resolution rate is a lagging indicator. For each fix, track a leading indicator that shows impact before resolution rate catches up: coverage rate (% of intents with at least one help article), routing accuracy (% correctly classified), tool call success rate.

The 90% ceiling for most support agents is not about model capability — it is about documentation coverage and routing precision. Both are solvable with non-ML work.

Challenge 2: The Refund Agent Incident

At 2:47am, a PagerDuty alert fires. Your refund processing agent — which handles 400 refund requests daily — has processed 47 refunds in the past 30 minutes totaling $83,400. Normal rate: 5–8 refunds/hour, average $180 each. The agent is still running.

The trigger appears to have been a bulk import of customer records from a new integration partner that went live at 2:15am. The import included 1,200 records with a "refund_pending" status flag.

Walk through your immediate response (next 60 minutes) and your postmortem action items.

Expert analysis

Immediate response (first 10 minutes):

Kill switch: disable the refund agent immediately via the feature flag. Not "pause" — disable. Every minute of delay is additional refund exposure.
Assess reversibility: contact the payment processor to determine whether the $83,400 in refunds has settled or is still pending. Most processors have a 30-minute reversal window. Act before that window closes if possible.
Scope the damage: query the full execution log for all refunds processed since 2:15am. You need a complete list before you can communicate with customers or finance.

Next 50 minutes: 4. Root cause hypothesis: the "refund_pending" flag in the imported records was likely interpreted by the agent as a trigger for action. The agent's tool schema or routing logic did not distinguish between "this record historically had a pending refund" vs. "this record requires a refund action now." 5. Communicate: finance and customer success need to know within the hour. Never let them discover an incident through customer complaints.

Postmortem action items:

Add schema validation for all new integration data before it reaches agent context. New data sources require explicit allowlisting, not implicit passthrough.
Implement a rate limit circuit breaker: if refund volume exceeds 2x normal rate for more than 5 minutes, auto-pause and alert. This is a monitoring gap.
Require human approval for bulk action triggers. Any agent action triggered by a bulk import should require an explicit approval step, not autonomous execution.
Add the 47-refund execution traces to your failure eval dataset. These are high-signal negative examples that prevent the same failure mode from recurring.

Interview-Style Reasoning Questions

Question 1

A VP of Engineering says: "We have a 1,200-person support organization. If we deploy an AI agent with a 70% resolution rate, we can reduce headcount by 50%. Why would we pay $0.99/resolution when we could just fire 600 people and use the savings?"

Walk through your response. Don't defend the product — help the VP reason through the decision correctly.

Expert thinking

The VP's math has three errors worth correcting.

Error 1: 70% resolution rate ≠ 50% headcount reduction. Resolution rate and headcount are not 1:1. Humans don't spend 100% of their time on resolvable tickets. They also handle escalations, QA of AI outputs, complex cases, and work the agent generated (follow-up tasks, account adjustments, process exceptions). A realistic productivity model suggests 70% resolution rate supports 20–30% headcount reduction in the first 12 months, not 50%.

Error 2: Headcount reduction has a risk profile. Laying off 600 support employees eliminates institutional knowledge, creates morale risk in the remaining team, and is irreversible if the agent underperforms. The AI agent can be turned off; the layoff cannot be undone. The option value of keeping human capacity while the agent matures is substantial.

Error 3: The comparison should be risk-adjusted. At $0.99/resolution and 400 daily resolutions, annual cost is ~$144K. Cost to maintain 600 support employees (fully loaded, $70K average): $42M. The question isn't "$0.99/resolution vs. $0" — it's "$144K/year vs. $42M/year, with $42M being irreversible and $144K being cancellable."

The right recommendation: deploy the agent, reduce new hiring, reallocate existing headcount to higher-value work (QA, complex case handling, product feedback), and measure resolution rate improvement over 6 months before making irreversible headcount decisions.

Question 2

A startup founder asks you: "We've been live for 8 months, 22 enterprise customers, $180K ARR, 74% resolution rate. Our Series A investor wants us to expand into three new industry verticals to show TAM. Our CTO says we should spend the next 6 months deepening the current product before expanding. Who is right, and how do you think about this decision?"

Expert thinking

The CTO is right, but the founder needs to be able to explain why to the investor in a way that doesn't sound like defensiveness.

Why the CTO is right: 74% resolution rate with no documented eval corpus, no domain SOP capture, and no published case studies is not a foundation for vertical expansion. Expanding into three new verticals simultaneously means three new failure surfaces, three new evaluation requirements, and diluted engineering focus. The likely outcome: 65% resolution rate across four verticals, no reference customer strong enough to close the next deal in any of them.

The asset the founder is undervaluing: 22 enterprise customers and 8 months of production data is a domain corpus, if it's been captured properly. The path to a credible Series A story is not TAM expansion — it is showing that this product compounds. Resolution rate improving from 74% to 85%+ over the next 6 months, with a published case study showing ROI, is worth more than theoretical TAM across 4 verticals.

How to frame it for the investor: "We're going deep before we go wide. Here's the bet: we get to 90% resolution rate in this vertical in the next 6 months, publish the case study, fine-tune a smaller model that gets us to 65% gross margins, and then we have a replicable playbook for vertical expansion — not a theory. The TAM story is the same; the execution risk is dramatically lower."

This is a better Series A story than "we're in four verticals" with mediocre metrics in all of them.

Question 3

You're in a board meeting. A board member raises: "The AI Act went into effect in August 2026. Two of our largest customers are in the EU. Our agent is classified as high-risk under Article 10. We don't have a conformity assessment, we don't have the 10-year documentation retention infrastructure, and we don't have a designated responsible AI officer. What's our timeline and what does it cost us?"

Expert thinking

This is a real situation for many agentic AI products in 2026 and the answer requires both honesty about the gap and a structured remediation plan.

Immediate risk assessment:

High-risk classification under EU AI Act means: mandatory conformity assessment, technical documentation, human oversight measures, transparency obligations, and registration in the EU database before further deployment to EU customers.
Not having this is not a theoretical risk — it is current non-compliance for an in-production system.

Timeline (realistic):

Weeks 1–2: Legal review to confirm classification and identify specific requirements for this system
Weeks 3–8: Technical documentation (system architecture, training data provenance, accuracy metrics, known limitations) — expensive to write if not already captured, fast if you have it
Weeks 8–16: Conformity assessment (internal or third-party) — typically 6–12 weeks for a new assessment
Parallel: implement 10-year audit log retention, human override mechanism documentation, and incident reporting procedure

Cost estimate (order of magnitude):

External legal: $50–150K depending on jurisdiction complexity
Conformity assessment: $30–80K for a third-party assessment
Technical infrastructure (audit logs, documentation system): $20–50K in engineering time
Ongoing compliance officer allocation: 0.5–1 FTE equivalent

Board framing: "This is a 4–6 month, $100–300K remediation program. The alternative is losing our two largest EU customers and potential fines up to €35M or 7% of global revenue. We should start immediately and pause new EU deployments until conformity assessment is complete."

Question 4

An interviewer asks: "Design an agentic product for a mid-size accounting firm that handles tax preparation for small business clients. Walk me through your wedge selection, ICP definition, architecture choices, and first 6-month success metrics."

Expert thinking

Wedge selection: The worst first wedge is "help accountants prepare taxes" — too broad, too seasonal, unclear verifiability. The best first wedge is the most repetitive, verifiable sub-workflow. In tax prep: document extraction and classification — taking the pile of documents clients submit (W-2s, 1099s, receipts, bank statements) and classifying, extracting, and organizing them into the right categories in the tax software. Score: Frequency (high during tax season, weekly during bookkeeping months), Pain (2–4 hours per client on average), Verifiability (correct category + extracted value, checkable against the source document), Tool access (QuickBooks, tax software APIs available), Reversibility (miscategorized documents are easy to correct before filing). Score: 6/7.

ICP: Not "accounting firms" — that's too broad. Target: accounting firms with 10–50 accountants, primarily small business clients ($200K–$5M revenue), using QuickBooks + Lacerte or Drake tax software, with a documented client document intake process. The system environment matters: the moat lives in the QuickBooks + Lacerte data shape, not in a generic document classifier.

Architecture:

Tool layer: QuickBooks API, document storage (S3/GCS), tax software import API
Context: per-client document history (what they submitted last year, what categories were used)
Model routing: document classification → fine-tuned smaller model; anomaly detection → frontier model
Eval: every extracted value is verifiable against the source document; build golden dataset from first 50 clients
Human handoff: confidence below threshold triggers accountant review queue
Observability: extraction accuracy per document type, correction rate by category

6-month metrics:

Document processing time: baseline → 30-minute target per client
Extraction accuracy: 95%+ on standard document types (W-2, 1099-NEC, 1099-DIV)
Accountant correction rate: below 8%
Gross margin: 55%+ (model right-sizing to smaller model after fine-tuning milestone)

Question 5

Your agentic product has been running in production for 9 months. Resolution rate: 82%. Gross margin: 51%. Churn: 3%. Your engineering team wants to rebuild the architecture to add multi-agent coordination (a specialized agent for each workflow type). The argument: it will push resolution rate to 90%+. Evaluate the proposal.

Expert thinking

This is a case where the technically impressive approach is likely the wrong one.

The cost of multi-agent coordination: Adding multiple specialized agents to replace a single agent with tool calls introduces: (a) coordination overhead — each inter-agent handoff is a new failure mode, (b) debugging complexity — a failure in a multi-agent workflow is dramatically harder to trace than a failure in a single-agent workflow with tool calls, (c) context fragmentation — each specialized agent sees only a subset of the session context, increasing the risk of lost information at handoff boundaries.

The resolution rate argument: The claim that multi-agent coordination will push resolution rate from 82% to 90% is speculative. The more likely path to 90% is: (a) analyzing the 18% non-resolutions to identify the top 3 failure categories, (b) fixing them with targeted data and prompt work, (c) fine-tuning a smaller model on the domain-specific failure dataset. This path is lower risk, lower engineering cost, and produces a better cost structure as a byproduct.

When multi-agent IS the right answer: Multi-agent coordination is justified when different workflow segments have fundamentally different context, tool sets, or model requirements that genuinely cannot be served by one agent — not when the single agent isn't performing as hoped. "It will get us to 90%" is not a valid architectural justification.

Recommendation: Reject the rebuild. Commit to a 60-day data-driven path to 90%: full failure analysis of the 18% non-resolutions, targeted fixes, eval measurement. If 90% is achievable by that path — you avoided 3–6 months of architectural risk. If it isn't, you now have real data showing what the architectural constraint actually is.

PREMIUM CONTENT

Unlock Premium Access to access this content.

WORKBOOK

Ready to apply this?

This chapter has 4 premium workbook exercises. Unlock Premium Access to practice and compare with expert reasoning.

$49 one-time — lifetime access

PRACTICE

Test your understanding

1 free practice question tied to this chapter.

Practice real interview scenarios and compare your approach with expert answers.