GTM, UX, and Governance
PLG lands the wedge; enterprise controls expand it — governance is not compliance overhead, it's a moat layer.
Distribution is not a function of your category label — it's a function of where work starts. Govern like your competitors are watching, because they are.
The Three Failures Nobody Diagnoses
Agentic AI products fail in market for three distinct reasons that have nothing to do with the model:
Wrong GTM motion. Sales-led enterprise GTM with no PLG self-serve creates 18-month cycles with no proof of value outside a controlled pilot. Customers cannot evaluate the product without deploying it — and they will not deploy without proof.
Wrong UX model. Interface built for feature demonstration rather than workflow integration. The agent lives in a destination app the user must remember to open. Work does not route through it by default.
Missing governance. Product shipped to regulated enterprise customers without audit trails, permission controls, or incident response procedures. The first production incident triggers contract review — or cancellation.
These failures compound each other. A weak GTM produces customers without enough usage to validate the wedge. A wrong UX surface produces usage that does not compound. Absent governance produces one incident that unwinds months of relationship-building.
All three are structural. None are fixable with tactical adjustments after launch.

Sources: Menlo Ventures 2025 AI Landscape, a16z AI GTM analysis, EU AI Act enforcement guidance, Anthropic 2026
The Market Has Shifted to Real Buying
The pilot-to-portfolio transition in enterprise AI is now measurable:
- 76% of enterprise AI use cases were purchased (not built internally) in 2025 — the build vs. buy debate has largely resolved for application-layer AI
- 47% of AI pilots converted to production — versus 25% for traditional enterprise software
- AI spending is moving out of innovation budgets into recurring IT and business-unit lines
- Enterprise leaders expect approximately 75% growth in LLM budgets over the next 12 months
This changes the GTM calculus. Products that are easy to evaluate, fast to deploy, and obviously better than DIY earn production conversion. Products that require long deployment cycles, heavy integration, or significant workflow redesign to show value do not convert at the same rate.
The bar is not "can we demonstrate value in a pilot." The bar is "can a buyer's team, without vendor assistance, reach a meaningful result in the first week."
PLG Is the Lead Motion — But Not the Only Motion
27% of AI application spend flows through PLG motions — nearly 4x the rate for traditional enterprise software. The pattern: individual users or small teams adopt the product, prove value through usage, and pull formal enterprise contracts forward.
Strong prosumer brands frequently generate enterprise demand without a traditional sales motion — users bring tools into organizations, usage proves value, IT formalizes the contract. Glean, Cursor, and GitHub Copilot all followed versions of this pattern.
This does not mean sales is optional. The modern motion is two-track: PLG to land, enterprise controls to expand.
The Land / Expand / Scale GTM Motion
| Motion | What it looks like | Who drives it | Entry criteria |
|---|---|---|---|
| Land (PLG) | Fast time to value, low-friction onboarding, narrow workflow wedge; individual or team starts without sales involvement | Product + self-serve | Wedge is demonstrable in <1 week; pricing transparent; no IT approval required |
| Expand (Enterprise) | SSO, RBAC, audit trails, analytics, budget controls, procurement readiness; formal contract replaces informal usage | Sales + solutions engineering | Existing usage in >1 team; champion inside IT or finance; clear ROI metric from PLG usage |
| Scale (Regulated) | Private deployment, security documentation, solution engineering, governance review, compliance certification | Enterprise sales + security | Industry compliance requirement (HIPAA, FedRAMP, EU AI Act); negotiated SLA and audit rights |
Think about it: A B2B SaaS company is launching an AI agent for financial analysts that can query internal databases, generate reports, and flag anomalies. They have a $30,000/year enterprise contract as their target. Should they launch PLG-first or enterprise sales-first? What information would you need to make this decision, and what would each choice require of the product?
Expert thinking
The answer depends on three factors — and the wrong default is almost always enterprise sales-first.
Factor 1: Can the value be demonstrated in a short, unassisted trial? Financial report generation and anomaly flagging can be experienced in a single session by an individual analyst without IT involvement — if the product has a sandbox environment with sample data or easy database connection. If yes, PLG is viable. If onboarding requires a data engineering team to build connectors before a single analyst sees value, PLG is not viable.
Factor 2: Who has the authority to start using it? Individual financial analysts at most companies cannot connect new tools to production databases without IT approval. But they can evaluate the product using sample data or a read-only connection. The PLG motion here is: analyst evaluates in sandbox → generates a proof-of-concept report → brings it to manager → IT approves the integration → enterprise contract follows. This is a three-step pull-forward, not self-serve with immediate deployment.
Factor 3: What is the discovery mechanism? At $30,000/year, the product is likely targeting mid-market or enterprise — which means individual analysts are unlikely to find it through organic search or app stores. Discovery is more likely through financial analyst communities, LinkedIn thought leadership, or content marketing. If discovery is community-driven, the PLG free tier is the proof asset for the community conversation.
The recommendation: Launch with a PLG sandbox tier (sample data, no IT approval required) that lets individual analysts generate sample reports. Use this as the qualification filter — analysts who generate 3+ reports in the sandbox are warm leads. Use sales to close the enterprise contract once the analyst brings the proof internally. This is the two-motion pattern applied correctly.
What you should NOT do: Build a 6-month enterprise pilot process as the primary evaluation mechanism, with no way for an individual analyst to verify the product's value independently. This is the 14-18 month sales cycle trap.
Self-assessment checklist:
- Did you identify that PLG requires a self-serve proof experience, not necessarily self-serve deployment?
- Did you distinguish between the analyst's evaluation and the IT team's deployment decision?
- Did you recognize that PLG and enterprise sales are not mutually exclusive — the question is sequencing?
Distribution Follows Workflow Surface
The best agentic products embed where work already begins. A separate destination app can work, but it requires the user to consciously redirect workflow to a new interface — which most users will not do consistently.
| Workflow | Native surface | Product | Moat |
|---|---|---|---|
| Coding | IDE / GitHub issue | GitHub coding agent, Cursor | Work starts in repo; agent is already there |
| Support | Ticket queue | Intercom Fin | Inbound ticket is the trigger; agent is already in the queue |
| Enterprise knowledge | Search bar / Slack | Glean | Knowledge lookup begins in search; agent retrieves and acts |
| Legal | Document environment | Harvey | Contract lives in SharePoint/DocuSign; agent reviews inline |
| Sales | CRM / email | Salesforce Agentforce | Activity starts in CRM; agent enriches and routes |
The implication for product architecture: the agent's primary interface should be the existing workflow tool — not a new SaaS product. This requires a different build strategy (integrations first, UI second) but produces structural distribution advantages that a destination app cannot replicate.
The destination app exception: a destination app can win when it becomes the new System of Action — when the workflow legitimately needs to be redesigned around the agent's capabilities. But this requires the product to be good enough that users want to change their workflow, not just add a tool.
Think about it: You're building an AI agent for procurement teams that negotiates vendor contracts. Your PM wants to build a dedicated web app with a clean interface. Your eng lead argues for building directly into the existing Slack and email workflow where procurement managers already live. What criteria determine which is right, and what would you need to validate your choice?
Expert thinking
The central question: does the workflow require a new primary artifact, or does it enhance an existing one?
The case for dedicated app: Contract negotiation requires a structured artifact — a document with tracked changes, version history, counterparty comments, approval workflows, and audit trail. None of this lives naturally in Slack or email. If the agent's output is a contract negotiation ledger (offer → counter → redline → approval), that artifact is better served by a purpose-built interface than by a Slack thread. The destination app exception applies here.
The case for Slack/email integration: The inputs to contract negotiation — "vendor X sent a revised MSA," "approve this redline," "what was the final price on the Acme deal" — all happen in email and Slack today. If the agent can intercept these trigger moments and provide its output in the existing context, procurement managers don't change their behavior to capture value. A Slack notification "I've reviewed the new MSA from Acme — 3 high-risk clauses flagged, here's the redline" is more likely to get used than "open ContractAgent.io to review your latest MSA."
The right answer for most procurement agents: Hybrid. Build a dedicated app as the primary artifact interface (contract workspace, version control, approval workflow) but distribute through existing surfaces (Slack trigger on new email attachment, email thread analysis on existing contract threads). The native app is for work; Slack/email is for discovery and notification.
What to validate: Put 5 procurement managers through a prototype session. Observe where they try to share the output of the agent — do they copy-paste into email, or do they want a link to the contract workspace? The answer tells you where the primary artifact lives.
Self-assessment checklist:
- Did you distinguish between the input surface (where work starts) and the output artifact (what the agent produces)?
- Did you identify that hybrid surface strategy is often the right answer for complex workflows?
- Did you propose a validation method rather than just theorizing?
Build vs. Buy — Where the Line Is
| Situation | Recommendation | Why |
|---|---|---|
| Workflow is common across industries, speed-to-value is priority | Buy packaged application | No differentiation in building; vendor continuous improvement compounds faster |
| Workflow is highly company-specific, data uniquely sensitive | Build internal | Domain advantage IS the moat — you need to own it |
| Infrastructure (orchestration, governance, eval tooling) is commodity | Buy or bundle | Competing on infrastructure is capital-intensive and rarely differentiating |
| Workflow logic on top of that infrastructure | Build | This is where differentiation lives — SOPs, routing logic, eval corpus |
Enterprise buyers increasingly purchase packaged AI applications when those applications continuously evaluate, optimize, and integrate faster than internal teams can. The build case is strongest when internal workflow data is the moat — not when the desire is simply to avoid vendor dependency.
UX — Designing for Human-Agent Collaboration
The Fundamental Shift
Traditional UIs are designed for human operation: navigational hierarchies, explicit commands, sequential steps. Agentic AI demands Intent-Driven UX — the user declares a desired outcome; the system orchestrates dynamically to achieve it.
The most important design principle: build around the native artifact of the workflow, not around the conversation.
| Workflow | Native artifact | Wrong design | Right design |
|---|---|---|---|
| Coding agent | Diff, test result, PR | Chat log of what the agent did | Reviewable diff with test status |
| Support agent | Resolved case, procedure execution | Chat log of conversation steps | Resolution summary with audit trail |
| Legal agent | Draft clause, citation-backed answer | Response text | Structured document with linked citations |
| Enterprise search | Sourced answer, retrieved evidence | Freeform text response | Answer with source attribution and confidence |
Users trust agents more when they can review the artifact that matters to their job — not the invisible reasoning that produced it.
Autonomy Controls Are Variable, Not Binary
Autonomy is not a binary on/off state. Production-grade products design for three operational modes:
- Watch Mode: User observes the agent navigating systems in real time. No actions taken without explicit approval. Best for onboarding and high-risk workflows.
- Assist Mode: Agent maps a plan and suggests next steps. Each step requires explicit approval before execution. Best for medium-risk workflows where user wants oversight without full manual execution.
- Autonomous Mode: Agent executes within defined budgetary and operational limits. Exceptions and confidence-below-threshold cases are escalated. Best for high-frequency, well-verified workflows.
The best feature for many agentic products is not more autonomy — it is better pause points.
Plans before execution, visible progress, reviewable diffs, simulation modes, session logs, replay capability, escalation paths, safe defaults, and reversible actions consistently appear in the best operator-grade products. These features calibrate trust without destroying the value of agency.
Think about it: You've deployed an accounts payable agent in Watch Mode for a 60-day pilot. The AP team loves the accuracy — 94% of invoices processed correctly. The CFO wants to move to Autonomous Mode immediately to eliminate the human review step. What's your recommendation, and what criteria would you use to validate that Autonomous Mode is safe to enable?
Expert thinking
94% accuracy sounds high — but the implications depend entirely on what the 6% failure rate means in practice.
The analysis the CFO is missing: A 6% failure rate on invoice processing has two very different risk profiles depending on the failure type. If 6% of failures are minor data extraction errors (wrong line item quantity) that are caught by the downstream approval workflow, Autonomous Mode may be safe. If 6% of failures result in incorrect payment amounts or payments to wrong vendors, Autonomous Mode at scale means hundreds of incorrect payments per month.
What to measure before enabling Autonomous Mode:
- Failure type distribution: Categorize the 6% failure cases by type and severity. How many are reversible (wrong GL code — fix in the accounting system) vs. irreversible (payment already processed to wrong account)?
- Failure rate at p90: What is the failure rate on the hardest 10% of invoices — unusual formats, split PO numbers, multi-currency transactions? Average failure rate hides tail risk.
- Rollback capability: Is the system capable of halting a payment that has been initiated but not settled? If yes, what is the window for reversal?
- Exception handling coverage: Does the agent have a defined escalation path for every category of ambiguous invoice? An agent that fails silently — processing an ambiguous invoice as best-guess — is more dangerous than one that fails loudly and escalates.
The recommendation: Move to Autonomous Mode for a defined invoice subset — invoices from known vendors with established PO patterns, below a dollar threshold (e.g., <$10,000), with verified bank account history. Retain Assist Mode for new vendors, high-value invoices, and unusual formats. This is tiered autonomy, not binary.
Measure the autonomous tier's failure rate and failure type for 30 days before expanding the scope. Present the CFO with a specific expansion schedule tied to measured outcomes, not a binary pilot/full-deployment decision.
Self-assessment checklist:
- Did you distinguish between failure rate and failure impact before evaluating the CFO's proposal?
- Did you propose a tiered autonomy approach rather than a binary decision?
- Did you specify measurable criteria for expanding autonomy scope rather than just deferring the decision?
Trust Is Built Through Micro-Interactions
Plan Mode: Display intended multi-step plan upfront. Users review, edit, and approve the full strategy before execution begins — oversight without sacrificing workflow velocity. Pioneered by Anthropic's Claude Code; now a standard feature expectation in enterprise agentic products.
Explainability on demand: Users can request specific clarifications — why this recommendation, what data was retrieved, what steps are planned next. Not a default data dump — on-demand.
Anticipatory interfaces: Predict user routines, pre-generate drafts, surface relevant actions exactly when needed. The agent initiates based on workflow state, not waiting for user prompt.
Rollback capability: Total confidence that if the agent alters systems erroneously, operators can instantly revert. Not a nice-to-have for enterprise — a procurement requirement.
Governance — Not Compliance Overhead, but Moat Layer
Minimum Governance Controls for Production
| Control | Implementation |
|---|---|
| Least privilege access | Agent inherits user-level permissions — never system-level; each tool call is scoped to what the authenticated user can do |
| Step-up approvals | Require explicit authentication for high-impact actions (write to production systems, send external communications, process financial transactions) |
| Immutable audit trail | Prompt + context + tool calls + arguments + outputs + human decisions — stored immutably; never overwrite |
| Sandbox / simulation | Preview destructive or irreversible actions before production execution; users see exact changes before approval |
| Controlled rollout | Feature flags, canary cohorts, kill switch that takes effect within <60 seconds |
| Incident playbook | Defined escalation path, rollback procedure, postmortem → eval dataset update process |
Why governance is a moat layer: in a market where raw model performance keeps moving, the company that can demonstrate eval discipline, permission boundaries, policy alignment, traceability, and realized business value in regulated environments will consistently beat the company with the flashier demo. Governance is a sales asset, not just a risk management function.
The Governance Maturity Model
| Level | State | Characteristics | What's missing |
|---|---|---|---|
| Ad Hoc | No governance | Agents deployed without policy documentation; no audit trail; no monitoring beyond uptime | Everything |
| Defined | Policy exists | Decision boundaries documented; basic logging in place; human escalation path defined | Enforcement and automation |
| Managed | Policy enforced | Automated guardrails; regular audits; permission-gated actions; budget controls active | Drift detection; adaptive policy |
| Optimized | Policy adaptive | Behavioral drift detection; continuous compliance monitoring; governance as code; incident → eval loop | (This is the target state) |

Most enterprise buyers in regulated industries are at Ad Hoc or Defined when they start agentic AI pilots. They will not approve production scale-up until the vendor can demonstrate Managed. Products that ship with Managed-level controls on day one close enterprise deals faster.
Think about it: You're a product manager at a startup that just closed its first enterprise deal with a healthcare company. During the security review, the customer's CISO asked for: (1) an immutable audit trail for every agent action, (2) a documented rollback procedure for any automated action, (3) evidence of how the system prevents prompt injection from patient data. You're currently at the "Defined" level of governance maturity. What would it take to get to "Managed" by the time the contract starts in 6 weeks?
Expert thinking
Six weeks is tight but feasible for the three specific controls requested — because these are engineering changes, not process changes.
Control 1: Immutable audit trail The core requirement is: every agent action (prompt, context, tool calls, arguments, outputs, human decisions) is written to an append-only store that cannot be overwritten.
Implementation path: switch the logging sink from a mutable database table to an append-only log (e.g., AWS S3 with Object Lock enabled, or an immutable audit log service). Add structured log emission at every tool call and decision point. Estimated effort: 1–2 engineers, 2 weeks.
The healthcare-specific addition: logs must include patient record identifiers (de-identified if necessary) that link each agent action to the context in which it occurred. This requires adding context metadata to every log event — roughly 1 additional week.
Control 2: Documented rollback procedure This is mostly documentation and light engineering. For every action the agent can take on a healthcare system (read, write, update), define the reversal procedure. Some actions are directly reversible (update a field — document the undo step). Some require human manual intervention (send a referral — contact the receiving provider). Some are irreversible by definition (send an external notification — document this as requiring human pre-approval).
Engineering: add a "dry run" mode that simulates the action and shows the diff before executing. Documentation: write the playbook for each action type, including who is responsible for executing reversals in each scenario. Estimated effort: 1 engineer + 1 PM week for documentation.
Control 3: Prompt injection prevention Healthcare data contains freeform text — clinical notes, patient messages, intake forms — that could contain adversarial instructions. The prevention mechanism: treat all patient-sourced text as untrusted data, separated from the system prompt and tool instructions by a defined trust boundary. Never interpolate patient text directly into system prompt positions.
Implementation: content isolation (agent receives patient data in a designated "data" field, never in the instruction path) + input sanitization (strip or escape instruction-like patterns before they reach model context) + output monitoring (flag any agent action that deviates from the expected decision tree for the current workflow state). Estimated effort: 1–2 engineers, 2–3 weeks.
Timeline: Run controls 1 and 2 in parallel. Start control 3 immediately as it has the longest testing cycle. By week 6: immutable audit trail operational, rollback playbook documented and tested in staging, prompt injection prevention validated against a test corpus of adversarial patient note examples.
Self-assessment checklist:
- Did you treat each control as a distinct engineering task with a specific implementation path?
- Did you identify that immutable audit trail and rollback documentation can run in parallel?
- Did you recognize that prompt injection in healthcare requires content isolation, not just input filtering?
The Regulatory Baseline
Three frameworks are immediately relevant for every product leader shipping agentic AI into enterprise:
EU AI Act: Full enforcement began August 2, 2026. Penalties up to €35M or 7% of global annual revenue. Mandatory transparency, risk assessments, and 10-year documentation retention for high-risk AI systems. Any agentic AI touching HR, credit, legal, healthcare, or critical infrastructure is likely classified as high-risk.
NIST AI RMF + GenAI Profile: Cross-sector risk management framework increasingly referenced in US enterprise procurement as a baseline evaluation standard. Four functions: Govern, Map, Measure, Manage. Not legally required, but appearing in enterprise RFPs as a maturity benchmark.
ISO/IEC 42001: International AI management system standard. Beginning to appear in enterprise procurement requirements as a certification baseline — particularly for European customers and global regulated industries.
Governance, safety, and trust are increasingly competitive differentiators. Certifications shorten enterprise sales cycles and justify premium pricing. They are not overhead — they are product investments.
Production Failure Modes That Kill Products
The most common agentic AI production failures are highly repeatable:
| Failure Mode | Root Cause | Prevention |
|---|---|---|
| Wrong wedge | Workflow fails verifiability or reversibility at scale | Score 7 dimensions before building; see Ch. 05 |
| Context pollution | Entire CRM dumped into context; Lost-in-the-Middle degradation | Span-level context metrics; active trim/summarize |
| Hallucinated tool arguments | Agent invents API parameters rather than failing loudly | Strict tool output tracing; schema validation |
| Recursive polling loops | Agent checks API status in tight loop vs. waiting for webhook | Step counters, timeout params, circuit breakers |
| Instruction drift | Attention to system prompt decays over long context windows | Context pinning — re-inject critical constraints adjacent to newest input |
| Prompt injection | Malicious instructions in processed data trick the model | Deterministic safety layer; content isolation; input sanitization |
| Multi-agent coordination failure | Overlapping responsibilities, conflicting state writes | Default to single-agent; add coordination only when justified |
| Unconstrained learning loops | Continuous feedback updates enable adversarial data poisoning | Hard bounds on learning; human review before weight updates |
What's next: Advanced Practice
The exercises below put these GTM, UX, and governance concepts under real pressure — scenarios where the safe-seeming choice leads to the structural failure, and the structural fix requires rethinking the product motion itself.
Advanced Applied Exercise preview: You've just been hired as Head of Product at a well-funded enterprise AI startup. The company has $8M ARR but is burning $2.2M/month, with a 14-month runway. The sales team is closing deals at 14-month average cycles. The PLG motion is an afterthought — a free trial that converts at 2%. The board wants a plan to cut burn and extend runway without cutting product investment. Where do you start?
Real-World Implementation preview: Intercom's Fin AI agent went from a support chatbot to the product that handles 67% of Intercom's own customer support volume — eliminating the equivalent of 40 support agent positions in 18 months. The critical design decision was not the model or the pricing. It was a single architectural choice made six months before launch about where the agent would live in the ticket workflow...
Interview Reasoning preview: A VP of Enterprise Sales says: "Governance requirements are slowing our deals. Customers keep asking for things we don't have yet — audit trails, rollback procedures, EU AI Act compliance documentation. Can we just tell them we're working on it?" Walk through how you'd respond, and what the right governance investment timeline looks like.
Subscribe to unlock the full advanced practice section.
Advanced Applied Exercises
Exercise 1
You've just been hired as Head of Product at a well-funded enterprise AI startup. The company has $8M ARR but is burning $2.2M/month, with 14 months of runway. The sales team is closing deals at 14-month average cycles. The PLG motion is an afterthought — a free trial that converts at 2%. The board wants a plan to extend runway without cutting product investment. You have 90 days.
Walk through your diagnostic and your 90-day plan.
Expert thinking
The core issue is a structural mismatch: a sales-led motion at a cost basis that requires a much shorter sales cycle to be sustainable. At $2.2M/month burn, the company needs approximately $26M ARR to break even. Getting from $8M to $26M on a 14-month sales cycle requires either a massive sales team expansion (which accelerates burn) or a fundamental change to the motion (which is the right answer).
Diagnostic (first 30 days):
-
Why is the free trial converting at 2%? Three possible root causes: (a) wrong audience for self-serve — enterprise procurement can't start on a free trial anyway; (b) time-to-value is too long — users don't reach an "aha" moment before the trial ends; (c) the free tier has the wrong feature set — missing one critical capability that would make it worth adopting. Understanding which of these is true determines the fix. Run 10 user interviews with free trial abandoners in the first two weeks.
-
What is the sales cycle bottleneck? 14 months is not the sales process — it is the decision-making process. Map the procurement journey: who is involved at each stage, what evidence they need at each stage, and where deals slow down. Most 14-month cycles have 2–3 months of actual selling and 10–12 months of waiting for procurement, legal, and budget cycles.
-
Where does individual usage exist inside enterprise accounts? If any existing enterprise customers have employees who use the product individually, map how they use it and whether that usage generates the kind of ROI evidence that procurement teams need.
90-day plan:
Weeks 1–4: Redefine the PLG tier with a specific ICP. Not "anyone can sign up" — define the role and workflow that can reach value in a single session without IT approval. If the product requires data connections to work, build a sandbox with representative sample data that lets the target user complete a real workflow without their own systems. Target: user can reach a meaningful result within 30 minutes of signing up.
Weeks 5–8: Add usage analytics visible to users. For every session, show the user what they accomplished and how long it took. Compare this to a baseline (industry average for the same task done manually). This data becomes the user's business case when they bring the product to their manager. Target: 25% of active free users have a shareable ROI summary within 45 days.
Weeks 9–12: Build enterprise triggers from individual usage. When a free user reaches a usage threshold (3 sessions, meaningful output generated), trigger a lightweight enterprise conversation — not a full sales process, but: "three people at your company are already using this; want to see how it works at the team level?" The individual usage signal is the qualifier; the enterprise conversation is the natural next step. Target: 15 enterprise inbound leads from PLG in the first 90 days.
What not to do: Cut the sales team. The sales team closes the deals that PLG identifies. The problem is not sales — it is the absence of the demand signal that makes sales efficient.
Self-assessment checklist:
- Did you diagnose the 2% free trial conversion before prescribing a fix?
- Did you identify that the 14-month sales cycle is a symptom (procurement process) not a cause?
- Did you design the PLG tier with a specific ICP and time-to-value target, not a generic "free trial"?
Exercise 2
You're the CPO of an enterprise AI startup that just closed its largest deal — a 3-year, $4.2M contract with a global insurance company for an AI agent that handles claims processing. Three months into deployment, your enterprise account manager reports: the customer is asking to add a new feature — the agent should be able to directly update the claim status in the insurance core system without a human approval step. Currently, the agent generates a recommended status update that a claims processor approves before the system is updated.
The customer argues that removing the human approval step will reduce processing time from 48 hours to 4 hours. Your engineering team says it's a 2-week change. Should you build it?
Expert thinking
This is a governance design question disguised as a feature request. The answer is not "yes, build it" or "no, don't build it" — the answer is "not yet, and here's the specific evidence required to unlock it."
What the customer is asking for: Fully autonomous mode on a consequential, partially irreversible action (claim status update). Once a claim status changes in the core insurance system, it triggers downstream effects — adjuster notifications, payment processing eligibility, customer communications. Some are reversible (status can be updated again); some create commitments (payment processing initiation) that are harder to reverse.
The governance analysis:
-
What is the current error rate on the agent's recommendations? If the agent generates a recommended status update that the human approver changes 8% of the time, removing the human approval step means 8% of claim status updates in the core system are wrong. At scale, this is thousands of incorrect status changes per month — each requiring investigation and correction.
-
What is the cost of an incorrect status update? An incorrect "approved" status on a fraudulent claim releases payment. An incorrect "denied" status on a legitimate claim triggers a regulatory complaint in most insurance jurisdictions. These are not equivalent to a wrong item in a shopping cart — they have legal and regulatory consequences.
-
Is there a partial autonomy path that captures most of the latency improvement? The 48-hour-to-4-hour improvement is not purely the human approval step. Break down the current 48-hour cycle: how much is human approval time, how much is queue wait time, how much is processing time? If queue wait time is 36 hours and human approval is 2 hours, removing the approval step doesn't get to 4 hours — it gets to 38 hours. The correct solution may be straight-through processing for a defined subset (claims below a dollar threshold with high-confidence scores above 98%) while retaining approval for the rest.
The recommendation: Build tiered autonomy, not full autonomy. Propose: claims with confidence score above 98% AND claim value below $5,000 process automatically. All other claims retain the approval step. Start with a 60-day pilot on the autonomous tier, measuring error rate and downstream correction rate. After 60 days, evaluate expanding the autonomous scope based on measured outcomes.
Present this as the governance-responsible path that gets the customer to most of the latency improvement without the regulatory risk of full autonomy on the first deployment.
What not to do: Build full autonomy in 2 weeks because the customer asked for it. The first incorrect automatic status change on a fraudulent claim that results in a payment is a contract review event — potentially a contract cancellation. The customer is asking for speed; your job is to give them speed in a way that doesn't create a liability.
Self-assessment checklist:
- Did you analyze what percentage of recommendations currently differ from human approval decisions?
- Did you identify that incorrect status updates have regulatory consequences specific to insurance?
- Did you propose tiered autonomy as the responsible path to the latency improvement rather than binary yes/no?
Exercise 3
A startup building an AI agent for small business accounting has grown to $3.2M ARR with 800 customers. Average contract value is $4,000/year. The product handles bank reconciliation, invoice categorization, and preliminary tax preparation. They have no formal governance — logging is minimal, there is no audit trail, and the rollback procedure is "call the support team."
The founder wants to expand to mid-market customers (50–200 employees, $15,000–$40,000 ACV). The first mid-market prospect's IT team asked for SOC 2 Type II certification. The founder is considering whether to invest in SOC 2 or to continue focusing on small business customers.
Build the framework for the founder's decision, and make a recommendation.
Expert thinking
This is a market positioning decision with a governance investment threshold question embedded in it. The framework has three components:
Component 1: Is the mid-market expansion genuinely a different ICP, or an extension of the current one?
Small business accounting (1–20 employees) and mid-market accounting (50–200 employees) have fundamentally different requirements:
- Small business: owner-operator who trusts the tool, low compliance overhead, willing to accept some error rate in exchange for cost savings
- Mid-market: finance team, controller, external auditors who review the books, potentially investor oversight, state and federal regulatory filings that must be auditable
This is not a pricing change — it is a product architecture change. Mid-market customers need the agent's actions to be auditable because their books are reviewed by third parties (accountants, investors, tax authorities) who were not present when the agent made decisions. The audit trail is not a nice-to-have — it is what makes the product usable in a mid-market context.
Component 2: What does SOC 2 Type II actually require, and is it the right starting point?
SOC 2 Type II certifies that a company has maintained specific security, availability, and confidentiality controls for a 6–12 month period. It requires:
- Formal security policies (access control, incident response, change management)
- Logging and monitoring infrastructure
- Annual penetration testing
- Employee security training and background checks
The certification timeline is typically 12–18 months: 6 months to implement controls, 6–12 months of audit period. Cost: $30,000–80,000 for the first year including audit fees and tooling.
For accounting software, the more relevant certification path may actually start with SOC 1 Type II (controls over financial reporting) rather than SOC 2. Many accounting software buyers ask for SOC 2 because it is what they know to ask for, but what they actually need is evidence that the system's outputs are reliable and auditable for financial reporting purposes. SOC 1 + immutable audit trail + documented reconciliation procedures may address the underlying requirement faster than SOC 2.
Component 3: What is the revenue case for the investment?
At $3.2M ARR with an 800-customer SMB base, the current average ACV is $4,000. Moving to mid-market at $15,000–$40,000 ACV means each new mid-market customer is worth 4–10 SMB customers. The SOC 2 investment becomes rational when: (number of mid-market customers won per year × incremental ACV vs. SMB) > (SOC 2 annual cost + opportunity cost).
At 10 mid-market customers per year at $25,000 ACV: $250,000 incremental ARR per year. At 20% churn: $200,000 net. SOC 2 cost: $60,000/year ongoing after certification. Net return: $140,000/year. Break-even: within 18 months of first mid-market revenue.
Recommendation:
Start with the governance infrastructure that mid-market customers actually require (immutable audit trail, permission controls, documented rollback procedures), which is a 4–6 week engineering investment. This is valuable regardless of SOC 2 and addresses the underlying requirement.
Simultaneously, begin the SOC 2 preparation process — it is a 12-month minimum commitment, so starting now means certification in approximately 18 months. Don't delay it, but don't gate the mid-market motion on it. Close the first mid-market deals with the audit trail + documentation package while the SOC 2 audit period runs.
Self-assessment checklist:
- Did you distinguish between what SOC 2 certifies and what mid-market accounting customers actually need?
- Did you calculate the revenue return on the SOC 2 investment with specific numbers?
- Did you identify a path to the first mid-market deals that doesn't require waiting 18 months for certification?
Real-World Implementation: Intercom Fin — From Support Chatbot to Support OS
Background: Intercom launched Fin AI Agent in 2023 as a resolution-focused support agent with a specific and unusual pricing model: $0.99 per resolved ticket. By 2025, Fin was handling 67% of Intercom's own customer support volume, with a 67% resolution rate across the 7,000+ teams using it.
The result is notable not because of the resolution rate — but because of how the product architecture made that rate achievable and the pricing model defensible.
The Architectural Choice
Intercom had a choice in how to integrate Fin into the ticket workflow. Option A: a separate AI chat interface that customers could reach alongside the existing support channel — a bot lane alongside the human lane. Option B: Fin sits in the ticket queue as the first responder, handling every inbound ticket before it routes to a human agent, with routing logic that determines which tickets to resolve autonomously and which to escalate.
Option A is faster to build and carries lower risk — if Fin fails, customers can still reach humans easily. Option B is the System of Action architecture: Fin is not a parallel option, it is the default first handler. Every ticket goes through Fin. The human agent only sees tickets that Fin has explicitly escalated.
Intercom chose Option B.
The reason: Option A produces a product that helps support teams. Option B produces a product that IS the support motion — with humans in a quality control and escalation role rather than a primary response role. The pricing model ($0.99 per resolved ticket) only works if Fin handles the full workflow, not if it handles an opt-in subset.
The Pricing Decision
The $0.99 per resolution model requires three conditions to work:
- Outcome observability: A resolution is unambiguous — the ticket is closed, the customer did not reopen it within 24 hours, and there is no escalation record.
- Automation rate: At 67% resolution rate, the unit economics require that the cost-to-serve 100 tickets (67 resolved, 33 escalated) is less than $66.33. If human handling cost per escalated ticket is $12, and agent cost per resolved ticket is $0.09 (compute + infrastructure), the math works: (67 × $0.09) + (33 × $12 cost not charged) = $6.03 in direct costs against $66.33 in revenue.
- Customer trust in the outcome metric: Intercom built a resolution reporting view that shows the customer the exact criteria used for each resolution — customer confirmed satisfied, no reopen within 24 hours, ticket not escalated. This transparency is what makes outcome pricing defensible.
The UX Architecture
Fin's UX makes the human-agent collaboration explicit rather than hidden. When Fin escalates a ticket, the human agent sees: the full conversation history with Fin, the escalation reason (confidence below threshold, outside defined scope, customer explicitly requested human), and Fin's recommended next steps. The human agent is not starting from scratch — they are reviewing and continuing Fin's work.
This design choice has two effects:
- Human agent productivity is higher on escalated tickets because they have full context.
- Human agents' corrections of Fin's reasoning become labeled training data automatically — the agent sees which recommendations the human accepted, modified, or rejected.
The correction data is the flywheel input. Every human override is a labeled example of a case where the agent's recommendation was wrong — which is exactly the training signal needed to improve the agent's handling of similar cases in the future.
Questions for self-reflection:
- Why did the $0.99/resolution pricing model require Option B (default first handler) rather than Option A (parallel channel)? What breaks economically with Option A?
- Fin's 67% resolution rate means 33% of tickets are escalated. What would you monitor to determine whether 67% is good, bad, or expected — and what would cause you to invest in improving it vs. accepting it as the natural ceiling?
- The resolution criteria (no reopen within 24 hours, no escalation) are observable but imperfect. What categories of "false resolution" does this criterion miss, and how would you detect them?
Interview-Style Reasoning Questions
Question 1
A founder pitches you on their enterprise AI product for corporate legal teams. They have 8 customers, $480K ARR, and a 22-month average sales cycle. They want to accelerate growth and have $3M of runway. When you ask about PLG, they say: "Our buyers are GCs and CLOs — they don't self-serve, they buy from sales reps they trust." Evaluate this claim and what you would do differently.
Expert thinking
The founder is conflating who makes the final purchase decision (GCs and CLOs, who do buy through relationships) with who generates the proof of value that enables that decision (individual attorneys and paralegals, who do self-serve). This is a common enterprise GTM mistake: optimizing for the decision-maker instead of the value-generator.
Is the claim true? Partially. GCs and CLOs do not sign up for free trials and self-serve to an enterprise contract. But they also do not buy something their teams have not already tested and validated. The question is not "do GCs self-serve?" — it is "how do GCs get the evidence they need to approve a purchase?"
In legal technology specifically: individual associates evaluate tools on their actual work, share results with partners, and partners bring them to GC attention. Harvey's growth pattern — 42% of Am Law 100 firms — was not driven by GC cold outreach. It was driven by associates using the product, partners seeing the output quality, and management formalizing the adoption.
The PLG motion for legal AI:
- Free tier for individual attorneys: limited queries per month, no firm-wide data access, no SSO required
- Result sharing: attorneys can share research outputs (with citations) as links to colleagues — this is the natural share mechanism for legal work product
- Team analytics: once 3+ attorneys at the same firm are using the product, surface a "your firm" usage summary that the managing partner or GC can see
- Enterprise triggers: usage above a threshold (10 attorneys, 50 queries/month) triggers an automatic enterprise conversation with the firm's administrator
The 22-month sales cycle diagnosis: Almost certainly caused by: (1) no usage data from individual attorneys to accelerate procurement trust; (2) security and compliance review that takes 6+ months without SOC 2 / confidentiality documentation; (3) firm IT approval required before individuals can even test.
The fix: unblock individual evaluation first (sandbox with anonymized case examples), parallelize the compliance documentation (SOC 2 preparation, attorney-client privilege considerations for data handling), and let the individual usage pull the enterprise procurement forward.
What I would not do: Add more enterprise sales reps. At 22-month sales cycles and $3M runway, more sales reps extend runway by 0 months and add burn. The only lever that shortens the sales cycle structurally is individual usage data that the procurement team trusts more than vendor claims.
Question 2
Your VP of Engineering presents a new feature: "Agent-Led Outreach" — the AI agent proactively identifies high-value prospects from the CRM, drafts personalized outreach emails, and sends them automatically on behalf of the sales team. No human approval required before send. The VP argues this will 10x the outreach volume. What is your response?
Expert thinking
This is the "invisible autonomy on high-risk systems" anti-pattern from Chapter 05, applied to external communications. The answer is not "never build this" — the answer is "not without specific governance architecture that the VP hasn't described."
Why this is high risk: External communications are irreversible by definition — once an email is sent, the recipient has it. The failure modes are serious: (1) personalization errors that reference incorrect information about the prospect create reputational damage and signal poor data quality; (2) outreach to contacts who have unsubscribed or requested no contact creates legal liability (CAN-SPAM, GDPR); (3) outreach to contacts in active deals where the sales team has established a specific communication strategy can derail the deal.
The 10x volume claim is correct — but 10x outreach volume with a 5% error rate is also 5x the outreach errors. The question is not the volume multiplier; it is the error multiplier.
The governance architecture required:
-
Send queue with review window: Drafted emails are queued for 4 hours before sending. The sales rep receives a notification with a preview and a one-click approve/defer/cancel action. If no action is taken in 4 hours, the default behavior (send vs. hold) should be configurable — but for new contacts, the default should be hold.
-
Suppression list integration: Before any email is drafted, check the contact against: unsubscribed list, active deal contacts, blocked domains, and contacts flagged as "relationship managed directly." If any flag is set, skip the contact and log why.
-
Personalization validation: The personalization fields (company name, recent news, relevant products) should be validated against the CRM record before they appear in the draft. If the CRM record is incomplete or the personalization confidence is below threshold, the draft should flag the uncertain fields rather than hallucinating data.
-
Audit trail: Every AI-drafted email, with the CRM data that drove the personalization, should be logged immutably. If a recipient complains about incorrect information, the audit trail shows exactly what the agent knew when it drafted the email.
The version I would approve: Draft → review → send, with the 4-hour review window and suppression list integration. Not automatic send. The marginal productivity loss of a 4-hour window is small; the reputational and legal risk of fully autonomous external outreach is not.
Question 3
You're interviewing for a Chief Product Officer role at an enterprise AI company. The hiring CEO says: "We just lost two deals in the last quarter because competitors had SOC 2 Type II and we didn't. I want to make governance a competitive advantage. How would you approach this?" Give your answer.
Expert thinking
The CEO's framing is correct — governance as competitive advantage — but the two lost deals to SOC 2 are a symptom of a broader gap, not the root problem. SOC 2 is an output of a governance discipline; building the discipline produces SOC 2 as one artifact among several.
My answer to the CEO:
"Two deals lost to SOC 2 is telling us that our customers have reached the procurement stage where they're doing security reviews, and we can't pass them. That's actually a good sign about deal quality — these are serious buyers. SOC 2 is the right investment, but it's not the only investment we need to make in the next 12 months.
Here's how I'd think about it:
Immediate (0–3 months): Close the gaps that are losing deals today. SOC 2 takes 12–18 months to certify. But security questionnaires are won or lost on the underlying controls, not the certification alone. I'd audit which specific questions on those security questionnaires we're answering 'no' or 'in progress' to, and fix the ones that can be closed in 90 days: immutable audit logging, access control documentation, incident response playbook, penetration test completion. These are also the SOC 2 prerequisites — we're not wasting work.
Medium term (3–12 months): Build governance into the product, not just the company. Governance controls embedded in the product — audit trail, permission scoping, sandbox mode, rollback capability — are a sales asset we can demo. When a CISO asks 'what happens if your agent takes a wrong action,' we show them the rollback capability, not describe it. Product-level governance is more persuasive than policy documentation.
Ongoing: Use the governance maturity model as a sales qualification tool. Most of our enterprise prospects are at 'Ad Hoc' or 'Defined' governance maturity. We're building toward 'Managed.' The gap between where they are and where they need to be for their own compliance obligations is a reason to choose us — we're helping them advance their governance maturity, not just selling a product.
SOC 2 certification by month 15 is a clear milestone. But I'd measure success earlier by: number of security questionnaires where we're scoring 'yes' on all major controls, and deal velocity improvement in regulated verticals."
Question 4
A growth-stage AI company ($18M ARR, 60 enterprise customers) is debating whether to build a mobile app for their enterprise knowledge management agent. The argument for mobile: "Our users are checking Slack on their phones — why not Glean?" The argument against: "Mobile enterprise search is a graveyard — Google tried and failed, Microsoft tried and failed." Evaluate both arguments and make a recommendation.
Expert thinking
Both arguments are partially correct, and the decision turns on a question neither side has asked: where does the user need the agent's output, and is mobile the surface for that need or just a discovery channel?
Evaluating the "for mobile" argument: The argument is valid as a distribution claim — users are on Slack on mobile, so they're already in a workflow context where agent output would be useful. But Slack is the notification surface. The question is whether the action the user needs to take on finding a relevant document requires a mobile-optimized interface, or whether the notification ("here's the answer to your question") is sufficient to render in Slack without a native app.
For knowledge retrieval use cases (find the relevant document, surface the answer to a question), a mobile-optimized result view within the existing Slack integration may capture 80% of the mobile value without a native app.
Evaluating the "against mobile" argument: The Microsoft and Google failures in enterprise search are instructive: they failed because enterprise search on mobile produces the same results as enterprise search on desktop, but on a smaller screen with less precise interaction and less context about what the user is currently working on. Mobile enterprise search failed because it was desktop search on mobile — not because mobile is wrong for the product category.
A modern agent with awareness of the user's current context (active meetings, recent Slack activity, recent documents opened) can provide a meaningfully better mobile experience than desktop search simply ported to mobile. The failure mode is "desktop search on mobile." The success mode is "what do you need right now, given what I know you're doing."
The recommendation: Don't build a native mobile app as the first investment. Instead, optimize the Slack integration for mobile consumption (result formatting that renders well on mobile, answer-style responses that don't require clicking through to a full search interface, action buttons that work on mobile). Instrument mobile usage of the existing product to determine: how many users are accessing Glean on mobile, what queries they're running, and whether the current desktop-first interface creates frustration on mobile.
Build a native mobile app only after validating that mobile usage is a significant share of total usage AND that the Slack integration cannot adequately address the mobile use case. The development cost and maintenance overhead of a native mobile app is significant; validate the demand before committing to it.
Question 5
You are three months into a new CPO role at an enterprise AI company. Your sales team tells you that three major enterprise deals stalled because the product doesn't support "Agent-Led Growth" — the ability for the AI agent to identify internal champions and automatically nudge them toward expansion. Your CTO says this is a 4-week engineering build. Your VP of Legal says it raises GDPR issues. Your VP of Sales says it's table stakes. How do you make the decision?
Expert thinking
This is a governance, legal, and product strategy question that is being framed as a feature prioritization question. The VP of Sales characterization ("table stakes") and the CTO characterization ("4-week build") are both potentially correct and both miss the more important question: what exactly is the feature doing, and what are the legal implications of each version of it?
First, decompose what "Agent-Led Growth" means in practice: There are multiple versions of this feature with very different legal profiles:
Version A: The agent identifies users who are not currently using certain features and sends them an in-product nudge (a contextual suggestion: "You might also want to try X — here's how it works"). This is a personalization feature. It uses data generated within the product. It is not a communication to a third party. GDPR implications are minimal for users who have consented to the product's data processing.
Version B: The agent analyzes user behavior to identify potential internal champions and notifies the sales team (the vendor's sales team) that contact X at company Y should be targeted for expansion. This involves the vendor processing the customer's users' behavioral data to generate sales intelligence for the vendor. This is a more significant GDPR implication — the customer has to consent to their users' behavioral data being used by the vendor for the vendor's own commercial purposes.
Version C: The agent automatically sends expansion outreach to internal users at the enterprise customer, without those users having opted into receiving such communications. This is the highest legal risk version and almost certainly the one the VP of Legal flagged.
The decision:
Build Version A immediately — it is a personalization feature with minimal legal complexity and directly addresses the product usage gap. If users are not using features that would make them champions, in-product education and contextual nudges address that.
Build Version B with explicit customer consent in the contract. Some customers will grant the vendor permission to receive champion identification signals; many will not. Make it an opt-in analytics feature with a clear data usage description in the contract.
Do not build Version C without a legal review that confirms it is compliant with GDPR, CAN-SPAM, and the customer's employee communication policies.
How to respond to the VP of Sales: The three stalled deals need decomposition. What specifically did the prospects ask for — is it in-product expansion nudges (Version A) or vendor-side champion identification (Version B)? Build Version A, close those deals with it, and determine whether Version B with opt-in is needed.
How to work with the VP of Legal: Ask for a written opinion on each version with timeline. "Raises GDPR issues" is not an answer; "Version A is compliant, Version B requires customer consent in the DPA, Version C is not viable" is an answer. The legal team needs to give you a specific ruling, not a flag.
Unlock Premium Access to access this content.
This chapter has 4 premium workbook exercises. Unlock Premium Access to practice and compare with expert reasoning.
$49 one-time — lifetime access