incrementality test sizing

How to Size an Incrementality Test Before You Run It

An underpowered incrementality test fails quietly. A shrug, a wide error bar, and a bill. Here is the math to know if your test can answer the question.

A failed incrementality test rarely looks like failure. It does not crash. It runs for the full month, spends the full budget, and hands you a number that sounds almost useful: a 4 percent lift, give or take 6 points. The campaign might be doing nothing. It might be your best channel. The test cannot tell you which, and you paid full price to learn that.

This is the most common and most expensive mistake in marketing experimentation, and it is avoidable. The test did not fail because the campaign was bad. It failed because it was too small to ever see the answer. Nobody checked, before launch, whether the experiment had the muscle to detect the effect at all. That check is called a power analysis, and it takes an afternoon. Skipping it turns a measurement exercise into a donation.

This piece walks through the statistics behind sizing an incrementality test or a geo experiment, in plain language, for someone who runs a marketing budget and has no intention of becoming a statistician. The math is less frightening than it sounds. The mistake it prevents is not.

Origin: a 1920s convention you are still living inside

The numbers you will see in any sample size calculator, 0.05 and 80 percent, are not laws of nature. They are habits, and old ones.

The first comes from Ronald Fisher, the statistician who built much of modern experimental design at an English agricultural research station in the 1920s. In his 1925 book Statistical Methods for Research Workers, Fisher needed a cutoff for when a result was surprising enough to take seriously. He picked the point where a result sits roughly two standard deviations from chance, which works out to a 1 in 20 probability, and wrote that it was "convenient to take this point as a limit," per a history of significance testing. Convenient. Not sacred. Fisher himself walked it back three decades later, writing in 1956 that no scientific worker should keep a fixed significance level "in all circumstances." The cutoff stuck anyway, because Fisher's statistical tables made it easy to use and because Fisher was the dominant figure in the field for decades.

Statistical power came a little later. Jerzy Neyman and Egon Pearson, working in the early 1930s, reframed testing as a choice between two competing claims rather than a single yes or no. That gave us two distinct ways to be wrong. A Type I error is a false alarm: you conclude the campaign worked when it did not. A Type II error is a miss: the campaign genuinely worked and your test failed to catch it. Power is the probability of avoiding that second mistake. The 80 percent target, popularised by the psychologist Jacob Cohen, means you accept a 1 in 5 chance of missing a real effect. Also a convention. Also negotiable.

Two things matter here. These thresholds were designed for crop trials and lab studies, where a wrong answer wastes a season, not a quarter of marketing spend. And the framework assumed honest, deliberate test design, the opposite of running a campaign for a fortnight and squinting at the dashboard.

Present: what actually decides whether a test can answer the question

Marketers have, sensibly, moved toward incrementality testing as the way to prove a channel works. Per eMarketer, 52 percent of US brand and agency marketers already run incrementality experiments, and 36.2 percent plan to invest more. But the same research lists the things holding people back, and they are revealing: 44 percent worry about the accuracy of results, 43 percent struggle to apply testing across ad types, and 41 percent say they lack adequate tools. Underneath all three sits one unglamorous skill. Knowing how big a test needs to be.

Four numbers decide that, and they pull against each other.

The first is your baseline. How many conversions you normally get, and how much that number jumps around week to week. A business with steady, predictable sales is easy to read. A business with spiky, seasonal, low-volume sales is noisy, and noise hides effects. Per GeoLift's own methodology, Meta's open-source geo-testing library, power analysis matters most precisely because marketing effect sizes are usually small and there is a real chance of missing the effect entirely. Variance is the fog the test has to see through.

The second is the minimum detectable effect, or MDE. This is the smallest lift your test is built to reliably catch. It is not a prediction of how the campaign will perform. It is a statement about the sensitivity of your instrument. A test with a 10 percent MDE is a kitchen scale: fine for flour, useless for weighing a letter. If the true lift is smaller than your MDE, the test will most likely come back inconclusive, and you will not know whether the campaign is weak or your test was blind.

The third is power, the 80 percent we met earlier. The fourth is the significance level, the 0.05. You can raise either to make the test stricter, but both raise the price.

Here is the relationship that does the most damage when people do not know it. Required sample size scales with the inverse square of the effect you want to detect. Halve the MDE and you do not double the test, you quadruple it. One worked A/B example shows it cleanly: detecting a 20 percent relative lift needs roughly 6,500 visitors per variation, a 10 percent lift needs about 26,000, and a 5 percent lift needs around 104,000. Same test, same business, three wildly different scales, and the only thing that changed was how small a lift you insisted on seeing. Small effects are expensive. There is no trick that makes them cheap.

Significance and the p-value cause their own confusion, so here is the plain version. When a test reports significance, it is answering one narrow question: if this campaign actually did nothing, how likely is it that I would see a result this strong by pure chance? A p-value of 0.05 means that fluke would happen about 1 time in 20. It does not mean there is a 95 percent chance the campaign works. It means the result is hard to dismiss as luck. A confidence interval, the "give or take" range, is the more honest companion. A lift of 4 percent plus or minus 1 is a finding you can plan around. A lift of 4 percent plus or minus 6, which spans everything from a loss to a strong win, is not a finding at all. It is the test telling you it could not see.

Geo tests add one more input. A user-level holdout can split a customer base into thousands of people. A geo test splits a map, and there are only so many cities or DMAs to work with. Per SegmentStream's guide for non-technical leaders, geo tests run on just a handful of regions, which makes both the number of available markets and how well they match each other a hard constraint on what the test can detect. Fewer markets, or markets that behave nothing alike, means a blunter instrument. GeoLift's market-selection tools exist for this: its multi-cell workflow runs a power analysis to pick which markets, how many, and for how long, before any money moves.

The sequence that keeps a test honest

Most people run the design backwards. They decide the budget and the duration first, launch, and let the statistics fall where they may. Done in that order, the MDE is whatever the leftover sample happens to allow, and nobody finds out it was a meaningless 12 percent until the test is already over.

Run it forwards instead.

Start with a business question, not a statistical one. What is the smallest lift that would actually change a decision? If a channel needs to clear, say, a 7 percent incremental lift to justify its spend, then 7 percent is the number that matters. A 3 percent lift would not change your mind, so the test does not need to resolve it. That threshold, the smallest lift worth acting on, becomes your MDE. This is the core move, and the same MDE guidance makes it explicit: the right question is what is the minimum lift I would actually act on, which makes the MDE a business decision rather than a statistical guess.

Then, and only then, run the power analysis. Feed in the baseline, the variance, your chosen MDE, 80 percent power, and 0.05 significance. The calculator returns the required sample size, which for a conversion campaign translates into spend and duration, and for a geo test into the number and choice of markets. GeoLift and the major incrementality platforms all ship this calculation; the work is gathering honest inputs, not running the tool.

Now you face the decision the whole exercise was built to surface. If the test the math demands is affordable, run it. If it is not, you have a genuine choice rather than a nasty surprise. You can accept a larger MDE, meaning the test only catches a bigger effect. You can extend the test to gather more data. You can pick a higher-volume market. What you must not do is run the small, cheap test anyway and hope. That is the version that bills you and teaches you nothing.

Beware setting the MDE too low. It feels rigorous to demand the test catch a 2 percent lift, but per Convert's explainer, a lower MDE requires exponentially more traffic, and a tiny MDE can quietly demand a sample no real budget will supply. Rigour you cannot afford is not rigour. It is a test that never starts.

Why "run it for two weeks and see" fails

It is the most natural plan in the room, and it is the trap. Two weeks is a duration, not a design. It says nothing about what the test can detect.

Two weeks of data might be plenty to catch a 15 percent lift and hopelessly short for a 4 percent one. Without a power analysis you have no idea which test you are in, so an inconclusive result is unreadable. Did the campaign do nothing, or was your test simply too short to see what it did? Both produce the identical flat line. You cannot tell them apart after the fact.

The "and see" half is worse, because it invites peeking. You watch the dashboard, a number looks good on day six, and you call it. Per Statsig's review of common experimentation mistakes, peeking at data mid-test inflates the false positive rate: check often enough and random noise will eventually cross your line on its own. A test sized in advance has a finish line set before the gun. A "run it and see" test moves the line to wherever the early numbers look best, which is not measurement. It is a story you tell yourself with a chart.

And the false negative is the costlier error here, because per work on underpowered tests, a too-small test carries an unacceptably high miss rate. A false positive wastes the next budget cycle. A false negative kills a channel that was genuinely working, and you never learn you left the money on the table.

Future and impact: the honest tradeoff nobody escapes

No power analysis hands you a free lunch. It hands you an honest one, by laying four costs on the table at once.

Sensitivity is the first. A test that can see a small lift is a better instrument, and it costs more, because of that inverse-square law. Duration is the second: more data means more time, and a measurement that lands two months late may miss the decision it was meant to inform. Cost is the third, and for a holdout it is not just media spend. It is foregone revenue. You are deliberately withholding advertising from a control group, and that group buys less. SegmentStream's guide puts a real figure on it: a 21-day test that withheld ads across half of a set of US states cost roughly 289,000 dollars in incremental revenue. The test was the cheap part.

Reach is the fourth and most overlooked. Every market or audience you freeze into a control group is a market you are not selling to at full strength. A bigger, sharper test sacrifices more reach. A test designed for a tiny MDE can quietly hold back a meaningful slice of your addressable market for a month or more.

These four trade against each other, and that is the point. You cannot have a cheap, fast, sensitive test that surrenders no reach. The value of the power analysis is that it forces the trade into daylight before launch, while you can still choose, instead of after, when all that is left is how to phrase the disappointment. As triangulated measurement, blending mix modeling, geo tests, and attribution, becomes the default approach for 2026, the discipline of sizing each test properly only matters more. A blurry incrementality result does not just waste its own budget. It poisons every model downstream that trusts it.

The discipline is simple to state and easy to skip. Decide the smallest lift worth acting on. Size the test to detect it. If you cannot afford that test, change the question on purpose, do not run a blind one by accident. A well-sized test can return bad news, and bad news you can trust is worth paying for. An underpowered test cannot even do that. It returns a shrug, and a shrug is the one result no budget should ever buy.

Council summary

This post argues that an incrementality test is sized, not scheduled: the question is never how many weeks to run but how small a lift you need to detect, and the only honest way to answer it is a power analysis done before launch. It teaches the four levers that decide a test's reach, baseline variance, minimum detectable effect, power, and significance, and it makes the one relationship that trips up most marketers concrete: required sample size scales with the inverse square of the effect, so halving the MDE quadruples the test. The reader's takeaway is a sequence they can run themselves. Start from the smallest lift that would change a decision, size the test to catch exactly that, and if the math demands a test you cannot afford, change the question on purpose rather than run a blind one by accident. A well-sized test can deliver bad news you can trust. An underpowered one delivers only a shrug, and a shrug is the one result no budget should pay for.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next