Wedge, ICP, and Architecture
Score the workflow across 7 dimensions before writing a line of code — the wedge determines everything downstream.
The wedge is not a marketing decision. It is an architectural decision that locks in your ICP, your moat, and your cost structure before you write the first line of code.
Why the Wedge Comes First
Most agentic AI products fail before they ship. Not because the model is bad, not because the engineering is wrong, but because the team picked the wrong first workflow.
Here is how it plays out: a workflow is selected because it demos impressively. A few early customers sign on the strength of the demo. Engineering builds for three months. Then in production — with real users, real data, real edge cases — the cracks appear. "Success" turns out to mean different things to different stakeholders. The agent produces outputs that can't be easily verified. High-stakes actions can't be easily reversed. The ICP turns out to be two different types of companies that need completely different context to work well.
By that point, the wrong choice is embedded in the codebase, the customer contracts, and the product identity. Pivoting is expensive. Staying is worse.
The 7-dimension scoring rubric, ICP definition, and architecture layer model in this chapter exist to make the wrong choice expensive before it happens — not after. Spending a week scoring candidate workflows before building is not excessive caution. It is the highest-leverage investment you will make in the product.

Sources: Anthropic 2026 Building Effective Agents, a16z enterprise AI playbook, enterprise production post-mortems
The Platform-First Trap
There is a recurring failure mode in agentic AI product strategy that is worth naming explicitly: platform-first thinking.
Platform-first assumes that if you build a broad, capable agent surface, the right wedge will emerge from usage data. Build broadly, learn from customers, then double down on what works.
The problem is that "broad" produces shallow. Usage spreads across many workflows but runs deep in none. No single workflow generates enough labeled data to build a good evaluation corpus. No single workflow generates enough failure examples to fine-tune a smaller, cheaper model. No single workflow becomes mission-critical enough to justify the switching cost that protects renewals.
The product that is "useful for many things" is in a weaker position than the product that is "essential for one thing." Customers don't renew tools they find useful. They renew infrastructure they depend on.
Wedge-first means: pick one workflow, own it completely, achieve demonstrable reliability there, and let the moat grow from that depth. Expansion comes from a position of established reliability in the first workflow — not from hedging across many simultaneously.
The System of Action test is the fastest way to evaluate whether a candidate wedge has the right shape. Does the product shorten or automate a real job path? Ticket → resolution. Issue → pull request. Contract → first draft. Fragmented knowledge → actionable answer. If the honest answer is "no — the agent produces helpful content but a human still completes the actual work," the product is a chatbox with better UI. That is a real product, but it is not a System of Action, and it will be priced and compete accordingly.
Strong PMF signals — customers scripting around your UI to build automations, usage that bypasses the interface and embeds in a workflow, work routed through the product by default — are all evidence that the product has crossed from "useful tool" to "infrastructure." That crossing is what the wedge selection is trying to accelerate.
Think about it: Think about an AI product you're currently building, evaluating, or familiar with. Apply the System of Action test: does it shorten or automate a real job path, or does a human still complete the actual work after interacting with the product? If it's currently a chatbox with better UI, what specific change — in scope, integration, or permissions — would move it across the line?
Expert thinking
The honest answer for most AI products in 2024–2026 is: they are still on the "useful tool" side of the line. The agent drafts; the human sends. The agent suggests the code change; the developer applies it. The agent generates the analysis; the analyst presents it.
This is not a failure — it is a product positioning choice with downstream consequences. A product on the tool side of the line is priced like software (seats, usage tiers) and competes on UX and model quality. A product on the System of Action side is priced like labor (per outcome) and competes on reliability and workflow depth.
The change required to cross the line is almost always one of three things: (a) adding write permissions so the agent can execute, not just recommend; (b) adding an explicit approval gate that the human controls, so execution is the default and human intervention is the exception rather than vice versa; or (c) embedding the agent directly in the workflow surface where the action already happens (ticket queue, IDE, CRM) so users don't need to redirect their work.
The trap to avoid: assuming the line can be crossed gradually without a deliberate product decision. Most products stay on the tool side indefinitely unless there is an explicit decision to build toward outcome ownership.
Self-assessment checklist:
- Did you identify which step in the workflow still requires human action before the outcome is delivered?
- Did you name the specific change (write permission, approval gate, or workflow embedding) that would move the product across the line?
- Did you note whether the pricing model currently reflects a "tool" or a "System of Action" — and whether those are consistent?
The 7-Dimension Scoring Rubric
Before writing a line of code, every candidate workflow should be scored across seven dimensions. This is not a bureaucratic exercise. It is the fastest way to surface the specific failing dimensions that will cause the demo to lie about production performance.
Frequency: Does the workflow occur daily or weekly? Low-frequency workflows accumulate data slowly. Without enough examples, you cannot build an evaluation corpus, cannot fine-tune a smaller model, and cannot detect behavioral drift. Monthly workflows are bad first wedges regardless of how much pain they cause.
Pain: Is the cost — in time or dollars — explicit and already budgeted? High-pain workflows have a natural budget line. The ROI conversation is easy. Low-pain workflows require the customer to create a budget where none existed, which is a sales motion that stalls at procurement.
Verifiability: Can success be scored quickly — by automated rules, tests, or a fast human reviewer within the same work cycle? This is the dimension that kills the most promising-looking demos. A workflow that requires a domain expert three days later to assess quality cannot be improved, cannot be priced by outcome, and cannot power a data flywheel.
Tool and data access: Does the agent have access to the context, APIs, and permissions it needs to act? An agent with incomplete visibility will hallucinate the missing context. This is not a model capability problem — it is a data access problem. Solving it means integration work before the first line of agent code.
Authority to act: Can the agent execute actions, or can it only recommend? An agent that produces recommendations for a human to carry out is an expensive search engine. The labor replacement value — which is the core value proposition — requires the agent to own the execution, not just the suggestion.
Latency tolerance: Is the result needed in seconds, minutes, or hours? Sub-two-second latency requirements constrain model choice to smaller, faster models or heavily cached architectures. A workflow that tolerates five-minute processing times has a fundamentally different architecture than one that requires real-time responses.
Reversibility: If the agent is wrong, can the action be undone easily? Irreversibility combined with low verifiability is the most dangerous combination in agentic AI. An agent that takes hard-to-reverse actions and whose correctness is hard to verify quickly is a trust destruction machine. If a workflow fails here, it is not a matter of adding more safeguards — it is the wrong first wedge.
Scoring: 5–7 of 7 means build with confidence. 3–4 means identify the failing dimensions and address them before starting. Below 3 means find a different workflow entirely. The two dimensions to score first are always verifiability and reversibility — they kill more production deployments than any other combination.
Think about it: A healthcare startup wants to build an agent that automatically schedules follow-up appointments for patients after a physician visit, based on the visit notes and the patient's insurance coverage. Score this workflow against all 7 dimensions. Which dimensions pass cleanly? Which fail or need more information? What would you need to change or add before recommending they build it?
Expert thinking
This is a workflow with a genuinely mixed score — interesting because it has a compelling use case but several structural weaknesses that need to be addressed before building.
Frequency: Pass. Post-visit scheduling happens at high volume in any clinic or health system.
Pain: Pass. Scheduling staff time is expensive and patients frequently fall through the cracks. Budget line exists.
Verifiability: Partial. "Was the appointment scheduled correctly?" is verifiable. "Was the right type of appointment scheduled given the physician's intent?" is not automatically verifiable — it requires clinical review. Distinguish between scheduling correctness (machine-checkable) and clinical appropriateness (requires human or clinical-rule validation).
Tool/data access: Needs investigation. The agent needs access to the EHR (visit notes), the scheduling system, and insurance eligibility. These APIs exist but often require months of integration work and HIPAA-compliant data handling. Not a dealbreaker, but scope this integration cost before building.
Authority to act: Depends on product design. If the agent books and sends a calendar invite without confirmation — this passes but creates risk. If the agent books pending patient confirmation — this reduces authority but adds a necessary safety step. Recommend the latter for an initial wedge.
Latency tolerance: Pass. Post-visit scheduling can happen asynchronously, within hours.
Reversibility: Partially fail. A wrongly scheduled appointment can be cancelled, but in healthcare workflows wrong appointments cause patient confusion, wasted slot time, and in some cases delayed care. Not catastrophically irreversible, but higher stakes than typical scheduling software.
Recommendation: Score is 4.5–5/7. Proceed with two constraints: (a) implement a patient confirmation step before appointments are finalized — this improves reversibility and reduces trust risk; (b) build clinical rule validation for appointment type before touching insurance logic — start with a narrow, verifiable sub-workflow (e.g., standard follow-up at a fixed interval) before expanding to complex cases.
Self-assessment checklist:
- Did you score all 7 dimensions separately rather than giving a single pass/fail?
- Did you identify verifiability as a split dimension (scheduling correctness vs. clinical appropriateness)?
- Did you propose a narrower initial scope (confirmation step, fixed-interval appointments) to address the failing dimensions?
Your ICP Is Probably Too Broad
Classic SaaS ICP work targets an industry and a job title. "Mid-market financial services companies, VP of Operations." That works when the product is software that humans use. It breaks down when the product is an agent that acts.
For an agentic product, two companies in the same industry with different tool stacks are functionally different ICPs. A support agent built for Zendesk behaves differently from one built for ServiceNow — different data schemas, different escalation path APIs, different authentication models. The evaluation dataset built for one does not transfer to the other. The SOP knowledge captured for one does not apply to the other.
The right framing is role + workflow + system environment.
A strong early ICP looks like: support agents handling billing dispute tickets in a Zendesk + Stripe environment. Not "support teams at mid-market SaaS companies." The specificity of the system environment is what enables reliable performance and what generates the labeled data that compounds into a moat.
Strong early ICPs from production: platform teams triaging infrastructure incidents in PagerDuty + Datadog + Confluence; legal ops analysts reviewing MSAs in DocuSign + SharePoint; enterprise knowledge workers searching across Confluence + Slack + Notion. In every case, the specificity is not a limitation — it is the foundation.
The ICP expansion pattern: dominate one role + workflow + system environment, build a reference customer with published metrics, then expand to adjacent environments. Harvey expanded from 6 to 60+ jurisdictions not by starting over but by reusing the same evaluation infrastructure and corpus architecture in each new jurisdiction. The depth of the first environment funded the expansion.
Think about it: Your company is building an AI agent for HR operations teams. Your current ICP is "HR teams at companies with 500–5,000 employees." Using the role + workflow + system environment framework, rewrite this ICP to be specific enough to be actionable for an agentic product. What information do you need that you probably don't have yet?
Expert thinking
"HR teams at companies with 500–5,000 employees" is a valid SaaS ICP. It is too broad for an agentic product because it says nothing about what the agent will actually do or what environment it will operate in.
A workable rewrite: HR operations specialists running new employee onboarding workflows in Workday + DocuSign + Okta environments, at companies with 500–2,000 employees where onboarding volume is 20+ new hires per month.
What changed and why:
- Role → HR operations specialist (not "HR team" — different roles have different workflows, different permissions, different tools)
- Workflow → new employee onboarding (not all HR workflows — one specific, high-frequency, verifiable workflow)
- System environment → Workday + DocuSign + Okta (the agent's performance depends on these integrations; building for Workday first and adding BambooHR later is a scope decision, not an expansion decision)
- Volume qualifier → 20+ hires/month (below this, frequency is too low to build a good improvement loop)
What you still need to find out: What does "completed onboarding" look like to this customer? Is it document signing completion (verifiable)? IT provisioning confirmation (verifiable)? Manager confirmation that the employee is fully set up (not automatically verifiable)? The success criterion determines whether outcome pricing is feasible.
Self-assessment checklist:
- Did your rewrite include all three components: role, workflow, and system environment?
- Did you add a volume qualifier that ensures sufficient data frequency?
- Did you identify what additional information is needed to define a verifiable success criterion?
Horizontal vs. Vertical — Where Moats Compound
The horizontal-vs-vertical decision is not a TAM decision. It is a moat decision.
Horizontal agents serve as broad utility layers across domains — enterprise search, general workflow orchestration. The TAM is large. But without domain-specific intelligence, horizontal products cannot execute mission-critical tasks reliably. "Can handle anything" becomes "not good enough at anything that matters." The foundational labs build horizontal capability by default. Competing there means competing against OpenAI, Anthropic, and Google — with their infrastructure and training budgets.
Vertical agents are purpose-built for specific functions or regulated industries. They embed domain-specific routing, regulatory logic, and specialized evaluation directly into the workflow. Vertical is where moats compound — because the domain corpus, the SOP library, and the eval dataset are hard to replicate without being in the workflow.
The current enterprise data is clear: the highest value is captured by vertical agents. Horizontal models handle routing and general assistance; vertical agents own execution for the workflows that matter most.
Takers, Shapers, and Makers is a useful frame for where your company sits relative to the AI stack:
A Taker consumes public APIs and SaaS tools as-is. Every competitor has access to the same capability. There is no differentiation, and the foundational labs can replicate the product in months. This is a valid starting point — not a viable end state.
A Shaper takes a foundation model and wraps it with proprietary workflow data, domain SOPs, and a labeled evaluation corpus. This is the right position for 99% of product companies. The moat lives not in the model but in the domain knowledge that makes the model perform reliably on specific, real-world tasks.
A Maker trains proprietary foundation models from scratch. This requires capital at the scale of OpenAI, Anthropic, or Google DeepMind. It is only viable if AI capability is literally the product — not the workflow built on top of it.
The practical implication for Shapers: stop evaluating AI vendors on benchmark performance. Evaluate them on how well their infrastructure surfaces your proprietary workflow data, domain knowledge, and SOPs to the model at inference time. That is where your differentiation lives.
The Six Architecture Layers
The demo works with a good prompt and a frontier model. Production reliability requires something else: a control surface that prevents the model from being wrong in expensive ways.
Six layers determine whether an agentic product is reliable in production:

Layer 1 — Tool layer: Reliable, schema-validated tool contracts. Typed APIs. Explicit error handling for malformed calls. Vague tool descriptions are one of the most common causes of production failure — the model invents arguments rather than failing loudly, and the corrupt call propagates into integrated systems.
Layer 2 — Context and memory: What the agent knows, how long it knows it, and when to compress. Context is finite. Long-running agents without active context management drift from their instructions — critical constraints get pushed out of the window, goals are forgotten, tool calls become increasingly off-target. Good products trim, summarize, compress, and pin critical instructions adjacent to current input.
Layer 3 — Model routing: Right-sizing models per task complexity. Not every step in an agent workflow requires a frontier model. Classification, extraction, and simple tool calls can run on fine-tuned 8B or 13B models at a fraction of the cost. Routing the right task to the right model is both a cost optimization and a reliability optimization — smaller, specialized models often outperform frontier models on narrow, well-defined sub-tasks.
Layer 4 — Evaluation: A golden test set, regression detection, and behavioral monitoring. This is the layer most teams skip entirely. Without it, you cannot detect when a model update changes agent behavior, cannot measure whether a prompt change improved or degraded performance, and cannot build the labeled dataset that enables fine-tuning.
Layer 5 — Human handoff: Approval gates, escalation paths, and exception routing for cases outside the defined scope. Human handoff is not a fallback for when the agent fails — it is a first-class product feature that calibrates trust and handles the long tail of cases that the agent is not designed for.
Layer 6 — Observability: Full trace logging, cost monitoring, and drift detection. For every agent action: which agent, what was in the context window, which tools were called, in what order, with what arguments, and what the outcome was. Without this, debugging a production failure is archaeology.
Teams that obsess over prompts but underinvest in layers 1, 2, and 4 ship products that are impressive in demos and fragile in production.
Architecture Patterns
Sequential refinement maps to predictable multistage work — draft → review → polish, extraction → enrichment → summary, intake → classify → respond. Each stage has defined inputs, outputs, and acceptance criteria. Use it when task dependencies are linear and known in advance.
Handoff and triage maps to support and service workflows where a specialist takes over based on task type. The handoff boundary must be explicit — ambiguous routing produces dropped tasks, not graceful degradation.
Maker-checker loops produce and then evaluate against explicit criteria. Strong for writing, analysis, and code refinement — but require a hard iteration cap (three rounds maximum before human escalation). Without a cap, the loop runs indefinitely and the cost spirals.
Planner-plus-ledger is the right pattern for open-ended operations where the agent needs to document its intended actions before touching external systems. The visible plan is simultaneously a trust feature (users review before approval), a debugging artifact, and an audit record.
Memory and compaction is not a pattern for a specific task type — it is a requirement for any long-running workflow. An agent that runs long enough without active context management will forget its instructions.
Architecture Anti-Patterns
Multi-agent as default: The most common premature optimization. Every inter-agent handoff is a new failure mode, a new context boundary, and a harder debugging surface. Exhaust single-agent-with-tools first. Add coordination only when task structure genuinely requires it — not because the architecture feels more powerful.
Vague tool contracts: Tool descriptions that say "search the web for relevant information" without a typed schema produce agents that invent parameters under uncertainty. Every tool needs a strict contract: typed inputs, typed outputs, and explicit error handling.
Concurrent agents on shared mutable state: Without conflict resolution logic, the last write wins. Two agents writing to the same record produce silently inconsistent state. Use immutable intermediate state and explicit merge logic.
Invisible autonomy on high-risk systems: Any agent taking write actions on production systems without a human-visible plan is a trust destruction event waiting to happen. The right pattern is always: plan → preview → approve → execute → audit.
LLM as execution engine: Probabilistic models should route to deterministic execution functions — not directly execute writes. "Call the Stripe refund API" should trigger a rigid, typed function call, not be handled by the model's tool-calling in an unvalidated way.
What's next: Advanced Practice
Ready to stress-test what you've learned? The exercises below go into production territory — real constraints, ambiguous signals, no clean answers.
Advanced Applied Exercise preview: You're three months into building a contract review agent. Your verifiability score was strong because your customers said legal ops teams review every output within 48 hours. You're now discovering that "review" means different things: some teams flag and fix errors; others approve without reading carefully. Your resolution rate metric is 87% — but you're not sure what that actually means...
Real-World Implementation preview: In 2024, a large insurance company built an agentic claims triage system. They started with a 4-agent architecture and a single orchestrating agent. Six months in, they had a working product — but not for the reason they expected. The most important architectural decision they made wasn't about the agents at all...
Interview Reasoning preview: A CTO asks: "We have a 7-dimension score of 4/7 on our target workflow — failing on verifiability and latency. Should we address those gaps before building, or can we iterate to 7/7 in production?" Walk through your answer, including the cases where iterating in production is acceptable and where it isn't.
Subscribe to unlock the full advanced practice section.
Advanced Applied Exercises
Exercise 1: The Verifiability Gap
You've built an AI agent for contract redlining at a corporate legal team. Your initial 7-dimension score was 5/7 — failing only on latency (attorneys sometimes need results in under 1 hour, which is tight for your architecture) and partially on verifiability (you assumed attorneys would check every output, but discovery interviews suggest senior partners "spot-check" rather than review thoroughly).
After 3 months in production:
- Resolution rate: 87% (agent's redline accepted without modification)
- Average review time per contract: 22 minutes (down from 90 minutes)
- Attorney satisfaction: 4.2/5
- Discovery: one senior partner used an agent-generated redline for a $40M deal without reviewing the indemnification clause; the clause had a significant error that was caught by opposing counsel
The error did not result in a loss, but it rattled the GC. She is asking you to implement a safeguard. Your product manager wants to require mandatory review of all agent outputs. Your engineering lead says a confidence-scoring system could flag only the clauses the model is uncertain about.
What do you implement, and how do you think about the tradeoff between protection and the product value you've delivered?
Expert thinking
The mandatory review approach (PM's recommendation) is the wrong answer. It eliminates the 22-minute vs. 90-minute time saving that is the product's core value proposition. If attorneys must review everything, the agent becomes a drafting assistant, not a reliable executor — and the pricing model, sales story, and ROI calculation all break.
The confidence-scoring approach (engineering lead's recommendation) is closer, but incomplete. Confidence scores on individual clauses are useful; the harder problem is that attorneys will stop reading the "low confidence" flags after two weeks if they are too frequent, and will miss them if they are too infrequent. This is a known failure mode of warning systems: they get ignored.
The right answer has three components:
-
Clause-level mandatory review for high-risk clause types: Not all clauses carry equal risk. Indemnification, limitation of liability, IP ownership, and governing law are the clauses where a single error has material consequences. Flag these for mandatory review regardless of model confidence. This is a small, explicit set — probably 8–12 clause types — that can be maintained as a product policy.
-
Session-level audit trail: Every contract review produces a record of which clauses were reviewed, which were accepted without modification, and which were changed. This gives the GC accountability visibility without requiring sequential review of every clause.
-
Deal-size threshold for mandatory human review: For deals above a material threshold (e.g., $10M), require human review of the full redline before submission. This is not product limitation — it is defensible product policy that enterprise legal teams expect.
The framing for the GC: "We're adding mandatory review for the 8 clause types that carry the highest legal risk, and requiring full human review on deals above $10M. For everything else, the product continues to deliver the time savings you're using. This is the right split — not 'review everything' and not 'trust everything.'"
Self-assessment checklist:
- Did you reject mandatory full review because of its impact on product value?
- Did you propose a solution scoped to the specific risk class (high-stakes clause types), not a blanket safeguard?
- Did you include a deal-size threshold as a clear, defensible policy?
Exercise 2: The Architecture Escalation
Your team is 6 weeks from launch. You're building a single orchestrating agent with tool calls for a procurement workflow: PO approval, vendor lookup, budget verification, and ERP write. Four tools, one agent, sequential refinement pattern.
Two days before the final architecture review, your most senior engineer proposes a refactor: separate the PO routing and approval logic into a dedicated "approval agent" with its own context window, arguing that the growing complexity of approval rules warrants an isolated, testable module.
You have 6 weeks of runway before launch. The current architecture is tested and working. What do you do?
Expert thinking
Do not refactor 6 weeks before launch. This is the right answer, and it needs to be communicated clearly.
The engineer's underlying concern is legitimate: if approval rules are complex and growing, they will be hard to maintain inside the orchestrating agent. That concern deserves a response — just not the response of a pre-launch architecture refactor.
Why to hold: The current architecture is tested and working. "Testability" of an isolated approval module is a valid long-term concern, not a launch blocker. A 6-week refactor 6 weeks before launch introduces new failure modes, new integration surfaces, and resets test coverage. The probability of shipping a regression is high. The benefit (better long-term maintainability) does not pay off before launch.
What to do instead:
- Document the approval rule complexity as technical debt, with a specific plan to refactor post-launch when you have production data on which rules are actually exercised.
- Add a test for every approval rule to the existing test suite now — this addresses the "testability" concern within the current architecture.
- Set a post-launch review date (6 weeks post-launch) to evaluate whether the approval logic has grown complex enough to justify the refactor. If it has — and you now have production data to inform the design — execute then.
The rule: architectural changes this close to launch require a defect with a specific, measurable impact, not a concern about future maintainability. Future maintainability is a valid reason to plan a refactor; it is not a reason to delay a launch.
Self-assessment checklist:
- Did you explicitly reject the pre-launch refactor rather than compromise (e.g., partial refactor)?
- Did you address the underlying concern (approval rule complexity) without the refactor?
- Did you set a specific post-launch review date with clear criteria?
Exercise 3: The ICP Expansion Decision
You've been in market for 9 months with a single ICP: support agents handling billing dispute tickets in Zendesk + Stripe environments. Resolution rate: 88%. Gross margin: 57%. Reference customers: 12.
Your sales team has three strong inbound leads from companies with significantly different environments:
- Company A: 800-person e-commerce company, Zendesk + Shopify (no Stripe)
- Company B: 500-person SaaS company, Freshdesk + Stripe (no Zendesk)
- Company C: 1,200-person fintech, Salesforce Service Cloud + Stripe (no Zendesk)
Each company has a different missing integration. Your CTO estimates each new integration takes 4–6 weeks. Your board wants you to show ICP expansion in the next quarter.
How do you prioritize these three opportunities, and what criteria do you use?
Expert thinking
The right prioritization framework: rank by (a) integration reuse vs. net-new integration work, (b) domain corpus transferability, (c) reference customer value.
Company A (Zendesk + Shopify): Zendesk integration already exists. The only new integration is Shopify — one new tool rather than a new workflow environment. The agent's Zendesk-specific context (ticket schemas, routing logic, escalation paths) transfers fully. The billing dispute workflow in e-commerce is structurally similar to SaaS billing disputes. Priority: highest. This is ICP expansion with minimal integration cost and maximum corpus reuse.
Company B (Freshdesk + Stripe): Stripe integration exists. The new integration is Freshdesk — a different ticketing system with different data schemas, different routing logic, and different API contracts. This is a new system environment, not an adjacent expansion. The eval dataset built for Zendesk does not transfer. Priority: second, but with a longer timeline. Assign 6-8 weeks to build and validate the Freshdesk integration properly before selling.
Company C (Salesforce Service Cloud + Stripe): Stripe exists, Salesforce Service Cloud does not. Salesforce Service Cloud is architecturally different from Zendesk — it is a CRM-native support environment, not a purpose-built helpdesk. The integration work is significantly more complex, and the workflow context (CRM-linked cases vs. ticketed support) is meaningfully different. Priority: third. This is a new ICP segment, not just a new integration. Treat it as a separate product expansion, not an ICP adjacent.
What to tell the board: "We're pursuing Company A immediately (6-week integration), Company B in Q3 (new helpdesk integration with full validation), and Company C as a 2026 product line expansion. This is orderly ICP expansion that protects our resolution rate rather than scattering engineering across three parallel integrations simultaneously."
Self-assessment checklist:
- Did you use integration reuse as the primary ranking criterion?
- Did you distinguish between "adjacent expansion" (new tool, same workflow) and "new ICP segment" (different workflow context)?
- Did you propose a timeline that sequences rather than parallelizes the integrations?
Real-World Implementations
Implementation 1: Cursor — Single-Agent Architecture at Scale
Cursor built one of the most widely adopted AI coding tools by deliberately avoiding multi-agent complexity. Their core architecture: one agent with access to a rich set of code-specific tools (file read, file write, terminal execution, test runner, linter, diff generation) rather than separate agents for each tool type.
Architecture decision: Rather than a "planning agent + execution agent + review agent" decomposition, Cursor keeps all context in a single agent that can invoke any tool in sequence. The agent sees the full conversation context, the full file structure, and the output of every tool call it has made.
Expert commentary: This is the architecture anti-pattern reversal in practice. The decision to keep a single context window means that every tool call result is available to the agent's next decision without a handoff boundary. When the test runner fails, the agent can see both the test output and the code it just wrote and the intent expressed in the original user message — all in one context. A "test agent" receiving only the test output would need the code and intent passed explicitly, creating a new failure surface at every handoff. The simpler architecture won at scale.
Implementation 2: Glean — Context Graph as Architecture Foundation
Glean's product architecture is built around a permission-aware organizational graph rather than a generalized retrieval system. Every connector (Confluence, Slack, Jira, Salesforce, Drive) feeds into a graph that tracks not just document content but who has access, who collaborates with whom, what team they're on, and what projects they're working on.
Architecture decision: The context layer (Layer 2) is Glean's primary engineering investment. Model quality is a secondary concern. The product improves when the context graph improves — not when the underlying language model is upgraded.
Expert commentary: This illustrates the architecture layer priority hierarchy correctly. In most agentic products, Layer 2 (context and memory) has a higher leverage impact on quality than model selection. Glean's architectural choice — build the best possible context layer, then plug in whatever model performs well — means that model improvements from providers are additive rather than foundational. Teams that invert this (best model, minimal context engineering) find that model improvements produce diminishing returns once context quality is the bottleneck.
Production Challenges
Challenge 1: The Tool Schema Regression
At 10:30am on a Tuesday, your monitoring alerts fire: agent error rate has jumped from 2.3% to 18.7% over the past 2 hours. The errors are concentrated in a specific tool call — your get_customer_account tool — which is returning a ValidationError on the account_status field.
You check the API documentation for your CRM provider. At 8:00am this morning, the CRM provider released an update that changed the account_status field from a string enum ("active", "suspended", "churned") to a nested object with status and last_modified subfields.
The agent has been passing the field as a string for all calls since 8:00am. Your tool contract did not include strict output schema validation — it accepted the field as Any.
Walk through immediate remediation and the architectural change that prevents this class of failure.
Expert analysis
Immediate remediation:
- Update the tool schema to handle both the old string format and the new object format with a compatibility shim — returning just
statusfrom the nested object while the fix is in progress. - Deploy within 30 minutes. Every minute at 18.7% error rate is customer-visible.
- Alert the CRM provider's developer relations team — schema changes without versioning or deprecation notice are a partner SLA violation.
Root cause: The Any type on the account_status field meant that schema changes from the CRM provider propagated silently into agent behavior rather than failing loudly at the tool boundary. The agent received a dict where it expected a string, passed the dict to its reasoning logic, and produced incorrect outputs or hard errors.
Architectural fix: Every external API field in every tool contract needs a typed schema. No Any types on fields that influence agent reasoning. The tool layer's job is to be the last line of defense against external API changes — it should fail loudly with a clear error message when it detects a schema mismatch, rather than passing unexpected data upstream.
Process fix: Add provider API changelog monitoring as a standard operational task. Subscribe to changelog feeds for every external API. For any schema change affecting tool contracts, require a tool schema update and regression test run before the agent sees production traffic.
The broader lesson: tool contracts are not documentation. They are runtime assertions. Treating them as such — with strict validation and explicit failure modes — is what separates Layer 1 done well from Layer 1 done poorly.
Challenge 2: Context Drift in a Long-Running Research Agent
You've deployed an agent that conducts multi-step market research for analysts — it searches, reads sources, synthesizes findings, and builds a structured report. Average session: 45 minutes, 120,000 tokens of context.
You're seeing a consistent failure pattern: in sessions longer than 30 minutes, the agent begins producing summaries that contradict facts it cited in earlier sections of the same report. Sections written in minute 35+ are qualitatively worse than sections written in minutes 1–15. Analysts are catching these contradictions in review and manually correcting them.
Diagnose the problem and design a fix.
Expert analysis
This is a classic context drift failure. As the session extends to 120,000 tokens, the original instructions, established facts, and critical constraints are being pushed toward the beginning of the context window. The model's attention degrades for information far from the current position — the "Lost in the Middle" phenomenon — causing the agent to produce outputs inconsistent with facts it "knows" but cannot effectively attend to.
Diagnosis confirmation: Check whether the contradictions are concentrated in specific content types — facts from early sources vs. facts from recent sources. If early-source facts are contradicted by late-session outputs (but not vice versa), this confirms context drift rather than a model reasoning failure.
Fix — three components:
-
Context compaction checkpoint at 60,000 tokens: At the halfway point, the agent pauses to produce a structured "facts established so far" summary — key findings, sourced claims, and constraints on the remainder of the report. This summary is pinned to the front of the context window. The full detail of earlier sections is summarized rather than preserved in full.
-
Fact assertion index: Rather than relying on the model to recall established facts from earlier in the session, maintain an explicit key-value store of important facts (company: revenue, source: URL, citation number: N) that is injected into the context window at each new section's start. The agent references the index rather than relying on attending to content 80,000 tokens earlier.
-
Section-level consistency check: After each section is written, run a lightweight check (separate inference call, small model) that compares claims in the new section against claims in the fact assertion index. Flag contradictions before the analyst sees the output.
The broader principle: for any agent session longer than 30 minutes or 50,000 tokens, assume that context drift is an active risk and design explicit management for it — not as a future enhancement, but as a Day 1 architectural requirement.
Interview-Style Reasoning Questions
Question 1
A senior engineer on your team argues: "The 7-dimension scoring rubric is just a checkbox exercise. Experienced product people already know intuitively which workflows make good agent wedges. Why are we formalizing it?" How do you respond?
Expert thinking
The engineer's intuition is partially right — experienced product people do develop a sense for good wedges. The rubric's value is not replacing that intuition; it is three other things.
First, it surfaces the specific failing dimension. Intuition produces "this feels wrong" without a diagnosis. The rubric produces "this fails on verifiability because our success criterion is ambiguous to multiple stakeholders." The latter is actionable; the former leads to a stalled decision or a gut-feel override that ignores the real risk.
Second, it creates a shared language for the team. When a PM, an engineer, and a sales lead are all looking at the same 7-dimension score, they are debating the same dimensions. Without the rubric, each person is evaluating the wedge against different implicit criteria. The rubric makes the disagreement legible.
Third, it catches the experienced person's blind spots. Most production failures that a rubric would have caught are in verifiability and reversibility — not in the dimensions experienced product people think about first (pain, frequency). The rubric forces evaluation of the dimensions that are genuinely hardest to predict from demos and early interviews.
The right framing: "The rubric is a forcing function for articulating the intuition we already have, not a replacement for it. If your intuition says 'good wedge,' the rubric should mostly confirm it — and the exceptions are exactly the ones we need to catch before we've built for three months."
Question 2
Your CTO says: "Layer 4 (evaluation) is important long-term, but we can add it after launch. We need to ship in six weeks." How do you respond, and what's the minimum viable eval infrastructure that should ship on day one?
Expert thinking
The "add evaluation after launch" argument sounds reasonable but is structurally wrong. Here's why, and what minimum viable eval actually looks like.
Why you can't add eval after launch: Without an eval baseline established before launch, you have no measurement of whether your first production prompt and model combination is performing correctly. When the model provider releases an update in month 2, you have no way to detect whether agent behavior changed. When you make your first prompt change, you cannot tell if it improved or degraded performance. Eval is not just a quality tool — it is the measurement infrastructure for all subsequent improvement work.
What minimum viable eval looks like in 6 weeks:
- 50–100 golden examples: real inputs with expected outputs, labeled by a domain expert. This takes 2–3 days for a domain expert and 1–2 days of engineering to automate the comparison.
- A weekly regression run: run the golden examples against the current model and prompt configuration every week. Flag any degradation vs. the launch baseline.
- Error rate tracking in production: log every session that ends in a human escalation or an explicit failure. This is the data source for the next round of golden examples.
That is the minimum viable eval infrastructure. It costs 1 week of engineering time and 2 days of domain expert time. The cost of not having it: you ship without a baseline, make prompt changes you can't measure, miss model drift for months, and discover your quality has degraded when customers start complaining — not when your monitoring detects it.
The right answer to the CTO: "Eval on day one is 1 week of work. Retrofitting eval after 3 months of production changes and model updates is 4–8 weeks of work to reconstruct what baseline looked like. We're not trading speed for quality — we're avoiding a harder problem later."
Question 3
You're designing an agentic product for accounts payable automation — the agent processes vendor invoices, extracts line items, matches them to purchase orders, flags discrepancies, and triggers payment approval workflows. An investor asks: "Where does the moat come from? Can't a well-funded competitor just build the same integrations?" How do you answer?
Expert thinking
The integrations themselves are not the moat. The investor is right that integrations can be replicated. The moat is what builds on top of the integrations once they are live.
The actual moat layers, in order of defensibility:
-
Workflow position: The agent sits in the AP workflow before the human reviewer. Every invoice passes through the product. This is the highest-value position in the workflow — and the hardest to displace once a customer has reorganized their AP process around it.
-
Vendor-specific matching intelligence: After 12 months in production at a single customer, the agent has seen thousands of invoices from that customer's specific vendor base. It has learned vendor-specific quirks — invoice numbering conventions, line item naming variations, common discrepancy patterns. This is not general ML; it is customer-specific knowledge that a new competitor cannot have on day one.
-
Evaluation corpus: Every flagged discrepancy and every human override is a labeled training example. After 12 months, the agent has a customer-specific eval corpus that enables fine-tuning for that customer's vendor base. A competitor starting from scratch has the same base model capability but zero of this labeled data.
-
Approval workflow integration depth: The deeper the integration into the ERP's approval workflow (GL coding, multi-level approval routing, exception escalation), the more the product becomes load-bearing infrastructure. Replacing it requires re-integrating the approval workflow, not just swapping the extraction model.
The investor framing: "Yes, a competitor can build the same integrations. They will be 12–18 months behind us in vendor-specific intelligence and approval workflow depth for our existing customers. And our existing customers have reorganized their AP process around our product — switching has a real operational cost, not just a contract cost."
Question 4
Design an evaluation framework for an agentic coding assistant that helps engineers triage and close GitHub issues by automatically generating pull requests. What does "good" look like, and how do you measure it?
Expert thinking
An eval framework for a coding agent needs to measure at three levels: execution correctness, workflow reliability, and production quality.
Level 1 — Execution correctness (automated):
- Test pass rate: does the generated code pass the existing test suite? This is the primary automated signal.
- Lint and type check pass rate: does the code conform to the repository's style and type constraints?
- Build success rate: does the code compile/build without errors? These three are fast, objective, and automated. They form the minimum bar — code that fails tests, lint, or build is categorically unacceptable regardless of other quality signals.
Level 2 — Workflow reliability (semi-automated):
- PR acceptance rate: what percentage of generated PRs are merged without modification? Requires a 2-week observation window per issue.
- Modification rate: for PRs that are merged, what percentage required significant changes before merge?
- Abandonment rate: what percentage of generated PRs are closed without merge after human review? These signals require production deployment and time to measure, but they are the ground truth for "did the agent do the job?"
Level 3 — Production quality (human eval + golden set):
- 50-issue golden set: a set of previously-solved issues with known correct solutions, used to measure consistency against a baseline.
- Weekly regression run: run the golden set against the current model/prompt configuration; flag deviations from the baseline.
- Issue complexity stratification: separate metrics for "simple fix" issues vs. "complex refactor" issues — the agent's performance profile is likely different, and conflating them masks both strengths and weaknesses.
The metric to report to stakeholders: PR acceptance rate (% of generated PRs merged without modification) is the most meaningful single number. It combines execution correctness, workflow reliability, and code quality into one auditable metric that engineers understand and trust.
Question 5
Six months after launch, your agent's performance has plateaued at 74% resolution rate despite two rounds of prompt optimization. A consultant recommends switching to a different foundation model. Your CTO is skeptical. Who is right, and how do you diagnose whether this is a model problem or something else?
Expert thinking
The CTO's skepticism is probably right. A resolution rate plateau after two rounds of prompt optimization is almost never primarily a model problem. Here is the diagnostic.
Step 1 — Segment the 26% non-resolutions by failure category. Do not guess — pull the logs. Typical failure categories: (a) the agent attempted and produced an incorrect answer the user rejected; (b) the agent correctly identified it couldn't help and escalated; (c) the agent produced no response or errored. Each category has a different root cause.
Step 2 — For category (a), drill into the failure type. Is the agent failing on factual questions it should know (documentation coverage gap), or on reasoning questions that require multi-step judgment (model capability ceiling), or on context-dependent questions that require data it cannot access (tool/data access gap)? These are three different problems with three different solutions.
Step 3 — Apply the appropriate fix:
- Documentation coverage gap → fix the documentation, not the model. Resolution rate will improve without changing anything about the agent.
- Tool/data access gap → add the missing integration. The model cannot compensate for context it cannot see.
- Model capability ceiling → this is the one case where a model switch is warranted. But it should be demonstrated on a specific, isolated failure category, not assumed from aggregate plateau behavior.
The rule: Never change the model before exhausting the documentation and integration explanations. Model switches are expensive (re-validation, potential behavioral drift in previously working cases, integration testing) and they address the narrowest class of resolution rate plateaus. The consultant's recommendation to switch models without a failure category analysis is working backwards from a hypothesis rather than from evidence.
Unlock Premium Access to access this content.
This chapter has 4 premium workbook exercises. Unlock Premium Access to practice and compare with expert reasoning.
$49 one-time — lifetime access