Chapter 1: ML Problem Framing

Start with the business outcome, then work top-down: business objective → product outcome → model goal → decision policy. If those four layers do not line up, no model improvement will save the system.

Framing determines whether any later modeling work matters. Most production ML failures happen because teams define the wrong problem, optimize the wrong proxy, or choose a decision policy that cannot move the business outcome.

Here is something counterintuitive: most ML projects do not fail because the model is wrong. They fail because the team aimed the model at the wrong thing.

The model metric looks fine, offline evaluation passes, and the engineers are proud of what they built. But the product does not improve. The business KPI does not move. Six months of work delivers nothing observable.

This is a framing failure, and it is the most common way ML projects die.

What framing actually means

Framing is the work you do before you touch any data. It is the process of defining what success looks like at every layer of the system: the business, the product, the model, and the decision.

Most teams skip this, or do it too quickly. They get a request ("build a churn model," "add a recommender," "detect fraud") and jump straight to feature engineering. The problem is that those phrases do not define a system. They define a modeling idea. There is a large gap between the two.

Think of it this way: imagine a team proudly building the best turn-by-turn navigation engine in the world before anyone agrees on where the car should go. The routing can be fast. The maps can be detailed. The voice assistant can be polished. None of it matters if the destination is wrong or undefined.

ML problem framing is the work of agreeing on the destination.

The framing stack

The most useful tool for framing is a four-layer stack. Each layer constrains the next. Work top-down. Do not jump to layer 2 until layer 1 is solid.

The ML Framing Stack

Level 0 — Business objective

State the business problem in one sentence and one metric with a timeframe.

Reduce monthly churn by 8% in two quarters
Increase first-purchase conversion by 5%
Cut fraud loss by $2M this year

If you cannot name a measurable KPI and timeframe, you do not have an ML project. You have a wish.

Level 1 — Product outcome

Describe what the product or workflow should do differently once ML exists. This is where the system becomes concrete.

Rank items so relevant ones appear earlier in the feed
Route support tickets to the right team faster
Flag risky transactions before approval

Notice what changed: it is no longer "use AI." It is "change this product behavior in this observable way." That is a thing you can test.

Level 2 — Model goal

Define what the model should estimate so the product can create that outcome.

Probability a user buys this item (not "build a recommender")
Probability a support ticket belongs to billing (not "classify tickets")
Probability a transaction is fraudulent (not "detect fraud")

This is the prediction layer. It is not the final business goal. It is one step in the chain.

Level 3 — Decision policy

Define how the prediction turns into a real action. This is the layer most teams forget entirely.

Show top 10 ranked items
Send high-confidence tickets directly to a queue; route uncertain ones to triage
Block transactions above a fraud threshold; send borderline cases to human review

This layer determines whether the prediction is actually useful in practice. A great model that produces predictions nobody acts on is not an ML system. It is a science project.

Running a bad idea through the stack

A vague proposal like "build a churn model" becomes specific and challengeable when you run it through the stack:

Business objective: reduce churn in the first 90 days by 10%
Product outcome: identify at-risk users early enough to trigger retention offers
Model goal: predict probability that a new user churns within 30 days
Decision policy: if risk score is above threshold, trigger a retention email and show an in-app setup prompt

Now you can ask real questions: Is 30-day churn the right target? Are the retention actions likely to change behavior? Is the prediction early enough to matter? Is the threshold cost-effective?

The framing stack does not answer these questions. It makes them visible so the team can answer them before building anything.

Think about it: Think of an ML feature from your company or a product you use often. Rewrite it as a framing stack: business objective, product outcome, model goal, and decision policy. Do not skip the decision policy — that is where vague ideas usually break.

Expert thinking

A senior engineer would not start by debating architecture. They would first test whether the system is legible across all four layers.

Example: "recommend creators to follow"

Business objective: improve 30-day user retention for new users
Product outcome: show each new user a more relevant feed earlier in their lifecycle
Model goal: predict probability that a user follows and continues engaging with a creator after exposure
Decision policy: rank candidate creators, show top results during onboarding, and cap repetition to avoid feed collapse

Why this is stronger than "build a recommender": it names the KPI instead of assuming engagement is enough, makes the workflow change explicit, defines a prediction that can be measured, and forces an action policy including constraints.

Self-assessment checklist:

Did you name one business KPI and a timeframe or context?
Did you separate the product behavior from the model prediction?
Did you describe what action happens after the prediction?
Did you avoid using "accuracy" as the business objective?

Why good offline metrics can still mean nothing

Offline model metrics tell you whether the model predicts its chosen target well on historical data. They do not prove the target was useful, the action policy works, or the user experience improved.

You can have:

A high-AUC fraud model that sends so many legitimate users into manual review that the economics are worse than before
A strong recommender that optimizes clicks but degrades long-term trust and subscription retention
A precise support classifier that routes tickets correctly, but by the time it fires, the customer has already churned from waiting

The pattern is consistent: the model wins at the narrow task, and the system loses at the broader goal. AUC improved and conversion dropped. That is a framing failure, not a modeling one.

This is why the framing stack matters. Before you tune a model, you need to prove that the prediction, the policy, and the KPI are actually connected.

Offline Metrics vs Real Business Impact

Should you use ML at all?

Before committing to a build, the first honest question is whether ML is the right tool. The answer is often no, and realizing that early is not a failure. It is good engineering.

Use ML when all of these are true:

The pattern is too complex or too brittle for explicit rules
The task repeats at enough scale to justify the system overhead
Data exists, or there is a credible path to create it
Errors can be tolerated, contained, or reviewed
Patterns evolve at a rate that makes rule maintenance impractical

Do not use ML when rules or simple heuristics already work well, when data is weak or blocked, when errors are catastrophic and cannot be contained, or when the maintenance economics do not justify the complexity.

A fast decision sequence: first, name the non-ML baseline (rules, search, manual review, simple statistics). Then identify whether the decision repeats often enough for ML gains to compound. Then price the error cost: what does a false positive or false negative actually cost? Then confirm that the features and labels you need actually exist at serving time.

Teams often justify ML because the problem feels sophisticated. "Interesting" is not a product requirement.

Think about it: A support team wants to route incoming tickets faster. Today they use forms, routing rules, and a small manual triage group. Should they use ML now, later, or not at all? Write your answer in three parts: (1) what baseline you would compare against, (2) one reason ML might help, (3) one reason rules might still be the better decision.

Expert thinking

A strong answer starts with the baseline, not the model.

Baseline: current routing accuracy, median time to assignment, and percent of tickets needing manual reroute
Reason ML may help: free-text tickets often contain messy signals that rigid rules miss, and volume may justify automated classification
Reason rules may still win: if ticket categories are stable, forms are structured, and volume is moderate, better forms plus a few routing rules may deliver most of the value at lower operational cost

The real decision is not "can a classifier work?" It is "is the classifier the cheapest reliable way to move the KPI?"

Self-assessment checklist:

Did you compare ML to a real baseline?
Did you mention data shape or decision volume?
Did you include operational cost, not just model quality?
Did you avoid assuming ML is automatically the advanced option?

The three ML product archetypes

Once you decide ML belongs, you need to decide how autonomous the system should be. This is not a modeling question. It is a risk and product design question.

Automation augmentation: the model replaces or augments deterministic logic. The system can outperform rules on a repetitive decision and still fall back safely. Success is measurable uplift against the baseline. Guardrails usually include confidence thresholds and hard business rules.

Human-in-the-loop: the model assists humans rather than acting alone. Use this when judgment matters, errors are costly, or you need to build trust before handing off decisions. Success is measured in time saved, queue quality, and lower human error rates.

Autonomous: the model acts without a human in the loop. Use only when latency, scale, or economics require direct action and failure can be tightly contained. Success here means failure rate, rollback readiness, and auditability, not accuracy alone.

The rule of thumb: the more autonomous the system, the more the engineering roadmap becomes risk management, monitoring, rollback, and policy design, not model tuning.

Think about it: You are designing an invoice anomaly system for finance operations. Should it be automation augmentation, human-in-the-loop, or autonomous? Explain: (1) the likely error cost, (2) which archetype you would start with, (3) what would have to be true before you moved to a more autonomous mode.

Expert thinking

A prudent starting point is human-in-the-loop.

Invoice anomalies often involve money movement, vendor trust, and compliance risk. False positives create review overhead, but false negatives can create financial exposure. A model can rank or flag suspicious invoices well before it should auto-block or auto-approve them.

To move toward autonomy, you would want: stable historical performance across segments, clear policy thresholds, proven reviewer agreement on high-confidence cases, and rollback and audit mechanisms.

This is a framing question because the archetype determines what "good enough" means. A model that is accurate enough for human-assisted review may not be safe enough for autonomous action.

Self-assessment checklist:

Did you tie the archetype to error cost?
Did you mention human workflow, not just the model?
Did you specify what evidence would justify more autonomy?

Choosing proxy labels — where good systems go wrong

The true outcome you care about is almost never directly observable. The system cannot directly observe "useful content," "high purchase intent," or "trustworthy answer." So it reaches for a proxy.

This is where many good-looking systems go quietly wrong.

Rate each candidate proxy label on five dimensions: how close is it to the real outcome (alignment)? Does optimizing it encourage bad behavior (gaming resistance — higher score means safer)? Do you get enough labels across important segments (coverage)? How quickly does the label arrive (latency)? If this proxy improves, is the KPI likely to move (causal usefulness)?

Ideal outcome	Proxy you might reach for	What goes wrong
Useful content	Click	Sensational content wins attention, degrades trust
Purchase intent	Add-to-cart	Many carts never convert
Quality answer	Thumbs-up	Only a small, biased group of users rates

The heuristic: sparse-but-aligned is almost always better than abundant-but-misleading. The easiest label to collect often has the most dangerous incentives baked in. Once the system is live, those incentives become product behavior.

Proxy Labels: Where Good Models Go Wrong

Think about it: Pick a product you use regularly that has a recommendation, ranking, or content feed. Identify the proxy label the system is most likely using. Score it on the five dimensions: alignment, gaming risk, coverage, latency, and causal usefulness. Which dimension is its biggest weakness, and what would a better proxy look like?

Expert thinking

A strong answer does not just name a proxy — it reveals the tradeoff structure.

Example: an e-commerce product feed most likely optimizes for click-through rate.

Alignment (2/5): click measures attention, not intent. Many clicks end in zero-second bounces.
Gaming resistance (1/5 — low resistance, high gaming risk): the system quickly learns that thumbnails, titles, and price anchoring drive clicks more than product relevance. Clickbait wins.
Coverage (5/5): dense signal available for all items and users.
Latency (5/5): immediate signal, no lag.
Causal usefulness (2/5): click uplift does not reliably predict purchase uplift, especially when users are browsing rather than buying.

Better proxy: a weighted combination of short-dwell-click (filter out bounces), add-to-cart, and purchase — with explicit downweighting of sessions where all three diverge from each other.

Self-assessment checklist:

Did you score all five dimensions, not just name the proxy?
Did you identify the single biggest weakness?
Did you propose a more aligned alternative, not just "better data"?

What can kill a production ML system before it ships

Many projects that look promising in a notebook die in production. The notebook hides the operating environment.

Eight common failure points:

Label scarcity or severe class imbalance
Features not available at serving time (training-serving skew)
Latency or throughput requirements the architecture cannot meet
Error cost underestimated — false positives or negatives are more expensive than expected
Distribution shift from seasonality, new users, new products, or behavior changes
Adversarial behavior — fraud patterns or spam adapt to the model
Compliance, privacy, or legal constraints that block the feature set
No real owner, monitoring budget, or operating model post-launch

The most silent killer is training-serving skew. It is primarily a data pipeline bug — the feature you trained on only exists in the offline dataset. At serving time, it is unavailable, delayed, or computed differently. The model degrades and nobody knows why. It is also a framing oversight: the system was designed around features that only existed offline.

Feasibility Audit Table

How to measure the right things

Keep three layers of measurement separate and never collapse them into one:

Business KPI — what leadership or the product owner cares about. Retention. Revenue. Error rate. Something that moves money or users.

Model metric — how well the model performs on its prediction task. Precision, recall, AUC, NDCG. This is an internal signal, not a business result.

Satisficing constraints — conditions the system must not violate. Latency under 200ms. Precision above 0.9 on the high-stakes segment. Fairness check passing. Review-load staying below what the ops team can handle.

Pick one primary model metric. If you have conflicting objectives (quality, engagement, safety), score them separately and combine them in the decision policy. Do not try to encode every tradeoff inside one opaque objective function.

What's next: Advanced Practice

The free section above covered the core framing concepts and tested them with basic apply exercises. The advanced section below goes further — messier scenarios, harder tradeoffs, production edge cases, and interview-style pressure tests.

Advanced Exercise preview: You inherit a 3-year-old fraud model. Offline metrics look healthy, but false positive rates have crept up on a specific user segment over the past quarter. Your manager wants a quick fix. The right answer may not be what they expect...

Production Challenge preview: A team's churn model passed every offline eval. The live experiment showed zero retention improvement for 8 weeks. The postmortem revealed three compounding failures — none of them in the model itself. Walk through the diagnosis chain...

Interview Reasoning preview: A VP asks you to justify why the team is spending 3 months on problem framing before training any model. You have 5 minutes and a skeptical audience. How do you make the case without getting lost in ML terminology?

Subscribe to unlock the full advanced practice section.

WORKBOOK

Ready to apply this?

This chapter has 3 premium workbook exercises. Unlock Premium Access to practice and compare with expert reasoning.

$49 one-time — lifetime access

PRACTICE

Test your understanding

7 free and 9 premium practice questions tied to this chapter.

Practice real interview scenarios and compare your approach with expert answers.

Chapter 1: ML Problem Framing

What framing actually means

The framing stack

Running a bad idea through the stack

Why good offline metrics can still mean nothing

Should you use ML at all?

The three ML product archetypes

Choosing proxy labels — where good systems go wrong

What can kill a production ML system before it ships

How to measure the right things

What's next: Advanced Practice

Advanced Applied Exercises

Exercise 1 — The Inherited Mess

Exercise 2 — The Multiclass Scaling Trap

Exercise 3 — Proxy Label Under Cross-Examination

Real-World Implementations

Production Teardown — A Home-Rental Marketplace's Search Ranking Reframe (Illustrative)

Production Teardown — A Professional Network's "Should We Use ML?" Discipline (Illustrative)

Production Challenges

Incident Postmortem — The KPI That Did Not Move

Interview-Style Reasoning Questions

Question 1 — Defending the framing phase

Question 2 — System design under constraints

Question 3 — Detecting silent framing failure