Picture the segment a marketing deck still describes with a stock photo: a 38-year-old woman, university educated, lives in a mid-size city, household income in a comfortable band. Now ask the only question that matters for a campaign: will she buy?
The honest answer is that you have no idea. That description fits a person who orders from you every three weeks and a person who bought once two years ago and forgot you exist. It fits a heavy discount hunter and someone who never opens a promotion. Age, gender, postcode and income tell you who a customer is. They say almost nothing about what a customer does, which is the thing you are trying to predict.
This is the case for moving past demographics, and for the two methods most teams run instead. One is a scoring scheme simple enough to build in a spreadsheet. The other is a clustering algorithm behind a large share of segmentation projects that breaks, quietly, on assumptions almost nobody checks. Both are worth knowing well, including the parts vendors skip.
Why demographics are a weak predictor
Demographic segmentation divides a market by traits like age, gender, income and location. It is easy to collect, easy to explain, and it has a real use in media planning, where you buy access to a broad population. The problem starts when you treat it as a behavior model.
The flaw is an assumption hiding in plain sight. Demographic segmentation assumes that people who share characteristics will act alike. They frequently do not. Two households at the same income and life stage can have opposite relationships with your brand, and a demographic model cannot see the difference because the difference is behavioral. As the analytics firm Circana frames the contrast, demographics describe who a person is while behavioral data captures what they actually do, and actual purchasing is one of the most reliable signals of future purchasing you can get.
Behavioral segmentation groups customers by what they have done: purchases, frequency, recency, channels, products, responses to past offers. It is harder to assemble because it needs real transaction and event history, but it is far more predictive. Past behavior is the closest thing you have to a preview of future behavior, and a demographic label is a guess about behavior dressed up as a fact. The rest of this piece is about the two behavioral methods you will meet first.
RFM: three numbers that refuse to die
RFM is the most durable customer segmentation method in marketing, and it is almost embarrassingly simple. It scores every customer on three behavioral questions.
Recency: how long since their last purchase. Frequency: how many times they have bought. Monetary value: how much they have spent in total. The idea is old. As TechTarget notes, the concept traces to a 1995 Marketing Science paper by Jan Roelf Bult and Tom Wansbeek on optimal selection for direct mail, born in the catalog era when every wasted mailing cost real postage. Catalog marketers needed to know who to send the expensive printed book to, and RFM was the answer.
The scoring is the part that makes it stick. You sort customers on each of the three measures and split them into five buckets, usually by quintile, scoring each from 1 to 5. A 5 on recency means a recent buyer, a 1 means a lapsed one. Do the same for frequency and monetary value and every customer carries a three-digit code. As CleverTap describes it, that produces up to 125 possible cells, from 111 at the bottom to 555 at the top. In practice teams collapse those into a handful of named groups: champions, loyal customers, at-risk, hibernating, lost. Braze and Klaviyo both ship this logic as a standard feature, which tells you how settled it is.
RFM is durable for reasons worth naming. It needs no machine learning and no data science team. It runs on data every business already has, the order table, so it survives the loss of third-party cookies untouched because it never needed them. The segments are self-explaining: anyone in the room understands what an at-risk high spender is and what to do about one. And it works. Each measure is a genuine signal, and recency in particular is a strong predictor of whether someone buys again.
Now the limits, because they are real and they are the reason the next method exists. RFM is descriptive and backward-looking. It is a clean summary of what a customer has already done, and it does not forecast anything. It will not tell you which new customer is likely to become a champion, only which existing one already is. It is also, by construction, a three-variable model. Recency, frequency and monetary value are useful, but they ignore everything else you might know: product categories, browsing depth, support tickets, channel, time between orders. A vendor write-up on the limits of RFM puts it plainly: RFM is descriptive analytics, a read of the past, while prediction is a separate job. RFM is an excellent dashboard. It is not a crystal ball, and it cannot see past three columns.
K-means: grouping by closeness, in many dimensions
K-means is the answer to the second limit. It is the clustering algorithm most teams reach for when three variables are not enough, and the idea behind it is simple once stated plainly.
You decide up front how many groups you want and call that number k. Suppose k is four. The algorithm drops four points into your data, the cluster centers, and assigns every customer to whichever center is nearest. Then it moves each center to the average position of its assigned customers and reassigns everyone again. Centers shift, memberships shift, repeat. After enough rounds the centers stop moving and you have your clusters. Formally, as Wikipedia describes it, the method partitions observations into k clusters where each observation belongs to the cluster with the nearest mean. It has a long history: Stuart Lloyd developed the standard procedure at Bell Labs in 1957 but did not publish it until 1982, Hugo Steinhaus described a related idea in 1956, Edward Forgy published essentially the same method in 1965, and James MacQueen gave it the name k-means in 1967.
The reason to use it over RFM is dimensionality. RFM works with exactly three variables and a fixed rulebook of cutoffs. K-means works with as many variables as you give it: recency, frequency and monetary value plus product mix, session counts, discount sensitivity, days since first order, and more. It needs no predefined thresholds. Instead of you declaring that a frequency above some number counts as loyal, the algorithm finds the groupings that minimize distance within each cluster. RFM imposes a grid you designed. K-means proposes a grouping it discovered. That is the real difference: rule-based versus pattern-found.
That sounds like a clear upgrade. It often is. But the discovery comes with assumptions, and the assumptions are where careful teams get caught.
The assumptions that quietly break K-means
K-means almost never throws an error. It runs, it returns clusters, the clusters get names and a slide. The failure is silent, and it has four sources worth knowing before you trust the output.
First, K-means assumes clusters are roughly round and roughly equal in size. The algorithm minimizes squared distance to a center, which pulls it toward compact, spherical blobs of similar scale. The scikit-learn documentation is blunt about this: its measure of cluster quality, inertia, assumes clusters are convex and isotropic and responds poorly to elongated clusters or irregular shapes. Real customer behavior does not arrange itself into neat spheres. If your true segments are stretched out or wildly different in size, K-means will carve the space the wrong way and hand you confident, tidy, incorrect groups.
Second, K-means is highly sensitive to feature scaling. It measures distance, and distance has units. If spend runs into the thousands and order count runs from one to twenty, the spend variable dominates every distance calculation simply because its numbers are bigger, and order count is almost ignored. A 2024 study on neglecting feature scaling in K-means found that on data with mixed units, high-magnitude variables took over cluster assignment, and standardizing the features first produced more accurate and interpretable clusters. Skip the scaling step and you have not segmented your customers, you have sorted them by whichever column has the largest numbers.
Third, K-means forces every customer into exactly one cluster. There is no maybe, no leftover bin, no noise category. A genuine outlier, a customer whose behavior resembles nobody else, still gets assigned to a cluster and still drags that cluster's center toward itself. The algorithm is known to be sensitive to outliers because it has no built-in mechanism to detect them, a gap that has produced a research literature on outlier-aware variants of K-means. Plain K-means lets a handful of extreme customers quietly distort a segment for everyone in it.
Fourth, you have to pick k yourself, and the tools for picking it are imperfect. The two standard aids are the elbow method, which plots within-cluster error against k and looks for the bend where adding clusters stops helping much, and the silhouette score, which rates how cleanly separated the clusters are. Both are useful and neither is decisive. As Built In notes, real data often shows no clear elbow at all, and reading the curve is subjective. A 2022 paper by Erich Schubert went further, arguing in its title that the elbow method should be dropped entirely because it lacks theoretical support and leads to poor conclusions. The number of segments is a judgment call wearing the costume of a calculation.
Alternatives, and the point that matters more
K-means is not the only clustering method, and its weak spots map almost exactly onto what the alternatives fix. Hierarchical clustering builds a tree of nested groupings and does not need k fixed in advance, so you can read the structure and cut it where it makes sense. DBSCAN finds clusters by density, so it discovers irregular shapes and labels genuine outliers as noise instead of forcing them into a group. Gaussian mixture models treat each cluster as a probability distribution, allow clusters of different shapes and sizes, and give every customer a soft membership rather than a hard one. A 2024 arXiv study of clustering algorithms on a UK online retail dataset of about 541,000 records, by Jeen Mary John, Olamilekan Shobayo and Bayode Ogunleye, compared K-means, a Gaussian mixture model, DBSCAN, agglomerative clustering and BIRCH on RFM features, and found the Gaussian mixture model scored highest on cluster quality, with a silhouette score of 0.80.
Here is where it would be easy to draw the wrong lesson. The wrong lesson is that you should always pick the algorithm with the best statistical score. The right lesson is quieter. A segmentation exists to be acted on. If a clustering is statistically excellent but you cannot describe a segment in one sentence, cannot say why it is different, and cannot design a distinct campaign for it, it has failed at its job no matter what the silhouette score says. Guidance on interpreting segmentation results makes the practical version of the point: segments have to map to real decisions, and a segmentation disconnected from business need produces insights that are meaningless or misleading. Four messy, slightly suboptimal clusters that the marketing team understands and uses will beat eight mathematically pristine ones that nobody can name. Interpretability is not a consolation prize. For a working team it is often the actual objective.
What to take away
Demographics describe customers. Behavior predicts them, and prediction is the job. That is the first move, and it retires the stock-photo segment for good.
RFM is the right place to start almost every time. It is simple, runs on data you already own, survives privacy changes, and its segments explain themselves. Respect what it is, a descriptive read of three columns of the past, and do not ask it to forecast. K-means is the natural next step when three variables stop being enough, because it works in many dimensions and finds structure instead of imposing a grid. Just go in with the four assumptions in hand: scale your features or the largest numbers win, expect round and similar-sized groups your customers may not form, know that outliers have nowhere to go, and treat the choice of k as a judgment call rather than a number the elbow plot hands you.
This pairs with the rest of the toolkit. Behavioral segments become far more useful when defined by predicted future value rather than past spend alone, and a segment is a starting point, not a targeting verdict: the difference between propensity, lookalike and uplift determines whether a campaign reaches the customers it can actually move. The method is never the point. A segmentation you can act on is.
Council summary
The post argues that demographic labels describe a customer but do not predict one, and that the fix is behavioral segmentation: RFM as the right first step, K-means as the next step when three variables stop being enough. Its real contribution is the unvarnished account of K-means failure modes, the four assumptions that break silently because the algorithm returns tidy clusters either way: it wants round, similar-sized groups, it is dominated by the largest-magnitude feature unless you scale, it forces outliers into a cluster, and it makes you guess k. The factual spine holds up, from the 1995 Bult and Wansbeek paper and the K-means lineage to the scikit-learn convexity caveat and Schubert's 2022 case against the elbow method. The takeaway for a decision-maker is concrete: start with RFM, move to K-means deliberately rather than by default, scale features first, and judge any segmentation by whether the team can name a segment and act on it.
Comments