propensity model targeting

Why Your Propensity Model Targets the Wrong People

Rank by purchase likelihood and most of the budget lands on people buying anyway. Uplift modeling targets the list where spend actually changes the outcome.

You have a budget. It pays for 50,000 discount codes, or 50,000 retention calls, or 50,000 pieces of direct mail, and someone has to decide which 50,000 customers receive them. The data team builds a propensity model, scores every customer on how likely they are to buy or to churn, and hands you a ranked list. You target from the top. The campaign reports a healthy conversion rate. Everyone signs off.

That campaign just wasted a large part of its budget, and the report it produced will never tell you so. The waste is invisible because it looks exactly like success. This is the single most common targeting mistake in marketing, and it is not a flaw in the propensity model. The model did its job. It answered the question it was asked. The mistake is that the question was wrong.

This post is about that mistake and the model that fixes it. It does not redefine propensity, lookalike and uplift modeling from scratch. A sibling post on the three model types does that. This one stays on the targeting decision itself: why ranking by propensity sends your money to the wrong people, and what to do instead.

Origin: a list that was always answering a different question

Propensity scoring as marketers use it grew out of database marketing and credit scoring. The machinery that estimated whether a borrower would default got pointed at a friendlier question: will this customer buy, churn, open, upgrade. The output is a probability for each person, and a probability sorts cleanly into a ranked list. That list feels like the natural input to a campaign. It is the wrong input, and the reason was spelled out decades ago.

In the late 1990s, analysts running direct mail noticed that a campaign with a strong response rate was not necessarily producing extra sales. Some of the responders were always going to buy. The idea, worked out by Nicholas Radcliffe and Patrick Surry in 1999 and by Victor Lo in 2002, was that a response is not the same thing as an effect. A 2007 paper by Stochastic Solutions on retention activity pushed the point further still: the act of trying to retain certain customers can be what provokes them to leave. The propensity list cannot see any of this. It was built to rank likelihood, and likelihood is not what a campaign changes.

Present: the four groups your propensity list hides

Start with what a high-propensity list actually contains. Sort customers by purchase propensity and the top fills with people scoring 0.9 and above. They are very likely to buy. You send them the discount. Many buy. The conversion rate looks excellent. Now ask the question the report does not: how many of those buyers would have bought without the code? A customer the model put at 0.92 was, by the model's own estimate, almost certain to purchase already. The discount did not persuade that person. It cut the price on a sale you already had and handed away the margin.

Uplift modeling makes this visible by splitting every customer into four groups, defined not by how likely they are to act but by how the treatment changes them.

Sure things act whether or not you treat them. They buy, they renew, they convert, and the offer changes nothing except your margin. The scikit-uplift documentation is blunt about them: there is no motivation to spend budget here because it has no effect. These people dominate the top of a propensity list almost by definition, because being highly likely to act is exactly what puts them there.

Lost causes do not act either way. The treatment is wasted on them too, just less expensively, since at least you are not discounting a sale.

Persuadables act only if treated. This is the one group where the spend creates a result that would not otherwise exist. faculty.ai describes the alternative as revenue cannibalisation, sending discounts to customers who would have bought the same products at full price. Persuadables are the opposite of that. They are the only group worth targeting.

Sleeping dogs, also called do-not-disturbs, are the dangerous group. The treatment makes their behavior worse. A retention call reaches a quiet customer who had not been thinking about leaving, reminds them their contract is ending, and prompts them to shop around. A discount email makes a happy buyer suspicious of the price they paid last time. Statistics.com describes the disgruntled customer spurred into action by a marketing message, or the customer reminded by your outreach to chase a competitor's offer. Contacting a sleeping dog does not waste money. It spends money to destroy value.

Here is the core problem. A propensity model cannot tell these four groups apart, because likelihood does not separate them. A sure thing and a persuadable can both score 0.8. One is 0.8 with the offer and 0.8 without it. The other is 0.8 with the offer and 0.3 without it. Same score, opposite value to the campaign, and the model sees identical twins. As Applied AI frames it, a customer at 80 percent with the offer and 10 percent without is worth 70 points of uplift, a customer at 80 percent both ways is worth zero, and a response model scores them the same. Worse, sleeping dogs often score high too, so a propensity-ranked campaign quietly pays to push some of its best customers toward the exit.

This is why a propensity-ranked campaign concentrates spend on sure things. They sit at the top of the list. The budget flows to the people who needed the nudge least, while the persuadables, who may sit in the unremarkable middle of the propensity distribution, get skipped. As Customer Science puts it, targeting high propensity can look successful while adding little net revenue once you account for what the control group did anyway.

The evidence that the right list is a different list

The clearest proof comes from retention, where sleeping dogs are common and the cost of waking them is real. The Stochastic Solutions paper reports two anonymized mobile operators. Operator 1 ran a campaign that cut churn across the target group from 30 percent to 25 percent, a clear win. Operator 2 ran a campaign that pushed churn up, from 9 percent to 10 percent. It was actively losing customers. Then both were re-targeted by uplift.

For Operator 2, uplift modeling found that the campaign worked for roughly 30 percent of the file, cutting churn by a percentage point for that group, while the negative effect on the rest more than wiped out the gain. Targeting only that savable 30 percent turned a loss-making campaign into a useful one. For Operator 1, already successful, uplift found a second kind of win: target the right 78 percent or so and overall churn drops by 6 percentage points instead of 5, more retention from less contact. The paper estimates the annual incremental impact of adopting uplift at roughly 4 million euros per million customers for Operator 1 and 8 million for Operator 2, at an ARPU of 400 euros. The Operator 2 number is larger because the campaign it replaced was destroying value.

The pattern shows up in other channels. Radcliffe's winning entry to Hillstrom's MineThatData email challenge, built on a clean test of two email campaigns against a control group, found that the women's mailing in particular appeared to reduce spending for some customer segments rather than lift it. A propensity ranking would have buried that signal completely. Practitioner reports from banking and telecom credit causal targeting with revenue gains in the 29 to 59 percent range over prior methods, and one top-20 US financial institution running a home-equity cross-sell cut mailing volume by nearly 40 percent across two campaigns while pushing incremental revenue well past triple what comparable past campaigns earned. The figures vary because the use cases vary. The direction does not.

The most useful number in this whole area is one that looks like a loss. The California Management Review reports a case where the plain treated group converted at 18.04 percent and the uplift-optimized group at a lower 17.32 percent, and yet the uplift group earned more profit per customer, 5.46 dollars against 5.18. A lower conversion rate was the better result, because the conversions it bought were incremental rather than free giveaways. The same article describes a campaign that cut the number of customers contacted by 80 percent, taking cost from 400,000 dollars to 80,000, with no loss of renewals, because most of the dropped customers were sure things who would have renewed regardless.

How an uplift-targeted campaign is actually run

The change in practice is smaller than it sounds. Four steps.

First, model the uplift. This means estimating, for each customer, the probability they act if treated minus the probability they act if not treated. The methods, meta-learners, uplift trees and causal forests, differ in how they get there, but the output is one number per person: the causal effect of your treatment on that person, which can be positive, near zero, or negative.

Second, rank by that number instead of by propensity. A walkthrough on Towards Data Science frames the shift cleanly: stop preaching to the converted, and rank by whose probability your offer actually moves rather than whose probability is already high. This is the same instinct behind incrementality measurement, which separates the conversions a campaign caused from the ones it merely sat next to.

Third, target the top of the uplift ranking, sized to your budget. These are your persuadables.

Fourth, and this is the step propensity campaigns never have, exclude the negative-uplift customers outright. Customer Science is direct that do-not-disturbs should be explicitly excluded, not merely deprioritized. A churn intervention guide on the Data Science Collective puts it as a rule: negative uplift means the treatment hurts, so avoid. Skipping sure things saves margin. Skipping lost causes saves spend. Skipping sleeping dogs prevents harm.

Future and impact: the honest caveats

Uplift targeting is better, not free, and the costs are real enough that it is not always the right call.

It needs an experiment. Propensity learns from history, because the outcome it predicts was observed. Uplift predicts a difference between a world that happened and a world that did not, so it needs both worlds in the training data. That means a randomized treatment group and control group, designed in before the campaign runs, not bolted on afterward. The Stochastic Solutions paper is firm that properly randomised control groups are the only reliable way to measure the true impact of an intervention.

It is noisier. The treatment effect is usually small next to the raw outcome rate, so the signal you want is easily drowned by the variation you do not. The same paper notes that for one operator the uplift was about a fifth of the churn rate, and for the other roughly a tenth. Uplift models are genuinely harder to fit and to validate than propensity models, and they need cleaner data to behave.

And sometimes propensity is the right tool anyway. If the treatment is nearly free, like an email, and sleeping dogs are rare, the waste from targeting sure things is small and a propensity ranking is the cheaper, simpler choice. The California Management Review piece is candid that uplift is not always worth it. Uplift earns its keep when the treatment is expensive, when capacity is limited, or when contact can backfire. A discount gives away margin. A retention call can wake a sleeping dog. Those are the campaigns where targeting the wrong list is most expensive, and where modeling the effect instead of the likelihood pays for itself.

The reframe is the whole point. Stop asking who is likely to act. Start asking whose behavior your money can actually change. That second list is shorter, it costs less to serve, and it is the only one that was ever worth buying.

Council summary

This post argues that ranking customers by propensity, the default move in most campaigns, systematically aims budget at the people least worth aiming it at. A propensity score measures who is likely to act, not whose action your spend causes, and those are different lists: high scorers are dominated by sure things who would convert anyway and salted with sleeping dogs your outreach actively pushes out the door. The fix is the four-quadrant uplift framework, which scores the causal effect of treatment per customer so you can target persuadables and explicitly exclude negative-uplift contacts. The evidence is concrete, from telecom retention campaigns turned profitable by targeting only the savable 30 percent to a documented case where a lower conversion rate produced higher profit per customer. The reader's takeaway is a single reframe with real money attached: stop asking who is likely to act, start asking whose behavior your money can change, and reserve uplift for the expensive, capacity-limited, or backfire-prone campaigns where that distinction pays for itself.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next