Paid social returns 4.10 on every dollar, says the platform. It returns 1.20, says the marketing mix model. It returns 0.85, says the geo test you ran in March. Same channel, same quarter, three numbers that do not even round to each other. Now allocate next quarter's budget.
Part 1 of this series made the case that marketing mix modeling, multi-touch attribution, and incrementality testing are not three answers to one question. They are three questions wearing similar clothes, each correct about something different: MMM about strategy, MTA about tactics, incrementality about cause. That reframe is freeing right up until the budget meeting, when someone has to write one number into the plan. Three honest estimates do not become a decision on their own. This post is about the part nobody enjoys: the reconciliation work that turns disagreement into a single estimate you can stand behind.
Origin: why the numbers were never going to match
Start by killing the assumption that causes the most wasted hours. The three methods are not failing to agree. They were built so they cannot.
A platform dashboard counts conversions that happened after an ad and within an attribution window the platform itself chose. It does not ask whether those conversions needed the ad. A marketing mix model relates aggregate weekly sales to aggregate weekly spend across every channel, including the offline spend no pixel sees, then tries to separate marketing from price, season, and demand. Incrementality runs an actual experiment: it withholds a channel from comparable markets and measures the gap. One method observes co-occurrence at the user level, one infers contribution at the market level, one measures cause directly. They are reading different instruments. Expecting one number is like expecting a thermometer, a barometer, and a hygrometer to display the same reading.
There is also a direction to the disagreement, and it is predictable. Platform and click-based attribution runs high, because it credits a channel for demand that channel merely intercepted. Retargeting is the clean example: it shows almost entirely to people already heading toward a purchase, so it sits in front of conversions it did not cause. Measured's own benchmark guide puts retargeting at 40 to 70 percent non-incremental, and branded search at only 20 to 40 percent incremental, meaning the majority of those conversions would have arrived without the ad. Incrementality runs low and conservative, because it strips out exactly that baseline. A well-built MMM tends to land between the two. Google's research on paid search bias names the mechanism on the modeling side: ad targeting creates a selection bias that, left uncorrected, inflates the measured return on search. The 2018 paper derives a correction and validates it against randomized experiments. The lesson underneath the math: a model alone, with no experiment to check it, can be confidently wrong in a consistent direction.
So the spread is not noise. It is signal about each method's known bias. The job is not to make the numbers converge. It is to combine them knowing which way each one leans.
Present: the wrong reflex, and the anchor that replaces it
Two instincts show up in that budget meeting, and both are wrong.
The first is to average the three. It feels balanced and defensible. It is neither. Averaging a biased-high number, a biased-low number, and a roughly-right number does not cancel the errors; it manufactures a fourth number that no method actually produced and that nobody can defend when finance asks how it was derived. The second reflex is worse: quietly pick the highest number, because it makes the channel, and the person who runs it, look good. Attribution becomes a political object. Whoever owns paid social cites the 4.10. Whoever owns the model cites the 1.20. The meeting turns into advocacy.
The fix is to stop treating the three as equals to be blended and start treating them as a hierarchy with one anchor. A useful shorthand: experiment to validate, model to scale, attribution to operate.
Incrementality is the anchor, the closest thing to ground truth, because it is the only method that ran a real counterfactual. Measured's guide to triangulated measurement states it plainly: experiment findings serve as the factual foundation that calibrates the other two pillars. That word, calibrate, is the whole game. An incrementality test gives you one trustworthy point: in this window, for this channel, the real return was 0.85. The MMM gives you something the test cannot, breadth across every channel including the untrackable ones, plus response curves that say what happens when you move a million dollars. The platform and MTA data give you speed and granularity, the daily read a buyer needs to swap a creative.
Calibration is how the anchor corrects the model. In a Bayesian MMM, the experiment result enters as an informative prior on that channel's coefficient. Recast describes it precisely: well-run experiments are treated as given, and the model finds the best fit for the remaining parameters around them. As their documentation puts it, these priors are really more like data than guesses, because the model treats them as truth, while still carrying the experiment's own uncertainty into the result. A worked walkthrough of Bayesian MMM calibration shows the mechanics: a geo test produces not a single figure but a distribution, say a lift of 1.2 plus or minus 0.2, and that distribution becomes the prior. The model is nudged toward the measured value without being forced to match it exactly, which tightens the credible interval for the tested channel and reshapes its saturation curve. LiftLab gives a concrete before-and-after: an MMM read paid social at 1.20 revenue per dollar, an experiment came back at 0.85, and the model's range was corrected toward the experiment. The model still does the heavy lifting of allocation. The experiment keeps it honest.
Attribution and platform data sit at the bottom of this hierarchy on purpose. Not because they are worthless, but because their job is in-flight signal, not budget truth. Funnel's framing of measurement as a system is the right mental model: platform data, attribution outputs, and model results are reference points that feed the system, not absolute truths. Use the platform number to catch a creative going stale on Tuesday. Do not use it to set the quarter.
Present: a reconciliation workflow
Calibration as a one-time event is not reconciliation. Reconciliation is a loop. Here is a workable version.
First, gather the three reads for the channel and write down each one's known bias next to it. The platform number, marked likely high. The MMM number, marked roughly central but only as good as its specification. The incrementality result, marked conservative and valid for that window. Seeing the biases on the page kills the averaging reflex before it starts.
Second, decide whether you even have a usable anchor. An incrementality result is only ground truth if the test was sound. Check the confidence interval. Measured's lift analysis guide makes the point that a result is incomplete without its uncertainty, and that not statistically significant does not mean no effect. A 5.4x return with a range from 0.2x to 9.7x is not an anchor; it is a question. A 0.85 with a tight band is. If the test was underpowered, the MMM stays your best estimate and you flag the channel for a better test.
Third, calibrate the model against the trustworthy anchors and let it allocate. The MMM, corrected, produces the budget estimate, expressed with marginal returns rather than a single average ROI. Mass Analytics describes a maturity ladder for this step: the weakest version is eyeballing results side by side, the strongest is full integration where experiment results enter as priors or coefficient constraints. Aim for the strong version.
Fourth, use the model to decide what to test next. Where its credible interval is widest, the model is least sure, and that is where the next experiment buys the most. The model prioritizes the experiments; the experiments constrain the model. That is the closed loop, recalibrating monthly instead of once a quarter.
Now the hard case: a genuine conflict, where a clean experiment and a well-built model still disagree by more than their uncertainty allows. Do not split the difference. Splitting the difference is the averaging mistake wearing a serious face. Investigate. Measured's MMM QA guidance is direct here: when the model and a sound experiment disagree, trust the experiment, then find out why the model is off. The usual culprits are a missing control variable, a promotion the model never saw, a misspecified adstock, or a data quality break. Their acceptance bands give you a working scale: within 10 percent is excellent, 20 to 30 percent is concerning, and a gap above 30 percent means do not act on the model's recommendation until the cause is found. A conflict is not a tie to be averaged away. It is a defect report telling you one of your instruments is miscalibrated.
For the deeper logic of why an experiment beats a model for causal questions, incrementality testing without the jargon covers the counterfactual that gives the method its authority.
Future and impact: a range beats a number
The last reflex to drop is the demand for a single clean figure. False precision is its own failure mode.
The honest output of reconciliation is a range with a confidence attached. Not paid social returns 1.34, but paid social returns somewhere between 0.8 and 1.1, and we are reasonably confident of that. This is not hedging. A budget decision made on a return of 5.0 is a different decision than one made on 3.0, so a range that spans that gap is telling you the channel is not ready to scale yet. A range that sits tight and clearly above breakeven is a green light. Funnel's read on this is that the goal is not perfect precision but enough confidence to allocate. The market increasingly agrees: in Haus's 2026 Marketing Decision Confidence Index, reported by EMARKETER, 78 percent of decision-makers believe at least 10 percent of spend is lost to insufficient measurement, and 33 percent named conflicting data as a top concern. A single false-precise number does not fix that. An honest range does, because it tells you when not to bet.
Expect the directions to keep diverging, and read the divergence rather than fight it. Funnel's triangulation argument makes a point worth keeping: the rule that you only act when all three methods agree is unworkable, because methods built on different data will rarely line up exactly. Map the trend, not the exact match. If all three say a channel is weakening, that direction is solid even when the magnitudes differ. If one says up and two say down, you have found the thing worth investigating this week.
Google's modern measurement playbook frames incrementality as the method that improves MMM and attribution rather than replacing them, and that is the durable shape of this. The agentic tools now entering measurement push the cadence faster, with agents that design tests and refresh models continuously, which makes the discipline of calibration matter more, not less. An agent that recalibrates a model weekly is only as trustworthy as the experiments anchoring it. The teams getting this right, including the way Perform Digital builds measurement into agent workflows, treat the reconciliation loop itself as the product: not the model, not the dashboard, but the running process that keeps the model tied to measured cause.
Part 3 takes the last step, from doing this reconciliation once to running it as an always-on system. The reframe to carry forward is small and load-bearing. You are not looking for the one true number. You are running a loop where the experiment corrects the model, the model directs the next experiment, and the output you take to the budget meeting is a range you can defend.
Council summary
This post argues that the three measurement numbers a marketer faces are not a contradiction to average away but a hierarchy to be reconciled, with incrementality as the causal anchor that calibrates the model and platform data relegated to in-flight signal. It teaches the mechanics clearly: an experiment result enters a Bayesian MMM as an informative prior, the corrected model handles allocation, and the model in turn points to where the next experiment is worth running. The strongest move is reframing a genuine model-versus-experiment conflict not as a tie to split but as a defect report, with concrete acceptance bands for when to stop trusting the model. The reader's takeaway is practical and disciplined: stop chasing one true number, run the calibration loop, and carry a defensible range, not false precision, into the budget meeting.
Comments