In August 2025, a research group inside MIT published a report with an unglamorous title, "The GenAI Divide: State of AI in Business 2025," and one sentence inside it traveled further than the report ever did. Ninety-five percent of enterprise generative AI pilots, it said, deliver no measurable return on the profit and loss statement. Within days the figure was on financial news tickers, in CFO briefings, and in the mouth of every skeptic who had been waiting for the AI bubble to show a crack. Nvidia and other AI-exposed stocks slipped. The number became shorthand for a whole mood: maybe this does not work.
Read the report properly and a different, more useful story appears. The 95 percent is real. It is also widely misunderstood, and the way it is misunderstood is itself the lesson. The pilots are not failing because the models are weak. They are failing because of how companies choose, scope, and run them. That is fixable, and the 5 percent who got it right show exactly how.
Where the number came from
The report was produced by MIT's NANDA initiative, a research effort housed in the university's Media Lab and led by Professor Ramesh Raskar. Its evidence base was modest and the authors were honest about that: roughly 150 leader interviews, a survey of around 350 employees, and an analysis of about 300 publicly disclosed AI deployments. The headline finding: of the enterprise generative AI initiatives studied, only about 5 percent had crossed into production and produced a measurable acceleration in revenue or cut in cost. The other 95 percent had not. Companies had collectively poured an estimated 30 to 40 billion dollars into generative AI, and most of it had bought no movement on the financial statements at all. One caveat to carry forward: this is a single study with a modest sample, so treat 95 as a strong signal, not a precise constant.
That is the number. Now the part the headlines dropped.
The report did not measure whether generative AI works. It measured a specific, demanding outcome: a pilot promoted to full production, tied to a business KPI, with impact visible on the profit and loss statement. That is a high bar by design. A pilot that made a support team noticeably faster but was never instrumented to prove it counts as a failure under that definition. So does a pilot that delivered real value in a function the report did not classify as revenue acceleration or hard cost reduction. The 95 percent is a measure of pilots that became audited, production-grade, financially proven systems. It is not a measure of pilots that were useless.
There is a second distortion worth naming. The same report describes a "shadow AI economy" running alongside the official one. While corporate pilots stalled, employees were quietly using ChatGPT, Claude, and Gemini through personal accounts to get real work done every day. Surveys through 2026 put unsanctioned AI use at roughly half of all workers, and CIOs who deploy monitoring routinely find the real tool count is three to five times what they estimated (see CIO's reporting). So the honest reading of the report is not "AI does not deliver." It is closer to "individuals are extracting value from AI faster than enterprises are, and the official pilot is where that value goes to die." The gap between those two facts is the GenAI Divide the report is actually named after.
None of this makes the 95 percent comforting. It makes it precise. And other data lands in the same place. S&P Global Market Intelligence found the share of companies abandoning most of their AI initiatives jumped from 17 percent to 42 percent in a single year, with about 46 percent of proof-of-concept projects scrapped before production (S&P Global). Gartner expects more than 40 percent of agentic AI projects specifically to be canceled by the end of 2027, citing unclear business value, escalating cost, and inadequate risk controls (Gartner). Three different research groups, three methods, one direction. The pilot failure rate is not a statistical artifact. The question is why.
Why pilots actually fail
Strip away the noise and the causes are consistent across the MIT report, the analyst data, and the practitioners who do this work for a living. They are not technology problems. They are decisions.
The pilot was chosen for visibility, not value. The most common first move is to pick a use case that looks impressive in a boardroom: a slick marketing copilot, a customer-facing chatbot, a sales assistant. The MIT report found more than half of generative AI budgets aimed at sales and marketing, while the clearest returns sat in back-office work, cutting outsourced processing, reducing agency spend, compressing operations. Flashy front-office demos win the funding meeting and lose the production fight. Quiet back-office automation does the opposite.
No success metric was set before the work started. Reporting on why enterprise pilots stall keeps surfacing the same pattern: teams cannot say what number the pilot was supposed to move (The Register). Without a numeric bar agreed up front, a pilot cannot pass or fail. It can only be liked or not liked. And when budgets tighten, anything without a hard ROI figure is first to be cut. A pilot with no metric has already lost. It just does not know yet.
The learning gap. This is the term the MIT report leans on hardest. Generic chat tools are flexible, which is exactly why they work for an individual and stall inside an enterprise. They do not learn the company's workflow, do not retain its context, do not adapt to how a specific team actually operates. Every session starts cold. A tool that cannot absorb the process it is dropped into stays a clever demo and never becomes infrastructure.
Build when you should have bought. The single most quantified finding in the report. AI tools bought from specialist vendors and deployed through partnership reached production successfully around 67 percent of the time. Tools built in-house succeeded roughly a third as often. Enterprises consistently overrate their ability to build this themselves, burn two or three quarters of senior engineering time on custom connectors and orchestration, and ship something a vendor already does better. The instinct to build is strong. The data on it is brutal.
The demo-to-production reliability gap. A demo runs on clean inputs and a happy path. Production is messy data, edge cases, and length. Multi-step agents fail here for a reason that is arithmetic, not bad luck: errors compound. An agent that is 95 percent reliable on a single step is only about 36 percent reliable across a 20-step task, because 0.95 to the twentieth power collapses to roughly that. Chain five such agents and the joint success rate falls to about 77 percent (MindStudio). A pilot that looked solid on ten curated runs falls apart on a thousand real ones. We cover this math in why your AI agent works in the demo and dies in production.
Weak data foundations. The model is only as good as what it can reach. Enterprise data is scattered across systems, duplicated, stale, and inconsistently labeled. Pilots break not because the model cannot reason but because it is reasoning over a poor, half-connected picture of the business. Integration into real APIs, with rate limits, legacy quirks, and undocumented fields, is where many pilots quietly die.
It was run as a technology project, not a process change. This may be the deepest cause. Prosci research traced 63 percent of AI transformation failures to human factors rather than technical ones (agility at scale). BCG frames the same point with its 10-20-70 rule: only 10 percent of the effort in an AI transformation is the algorithms and 20 percent the tools and data, while 70 percent is people and process, the roles, workflows, training, and governance (BCG). The typical failure is familiar: a motivated small team gets a strong pilot result, the rollout widens, and adoption collapses because nobody retrained the broader staff, redesigned the workflow, or gave managers a reason to enforce it. The tool worked. The change around it never happened.
What the 5 percent did differently
The same body of research describes the minority that crossed over, and the pattern is unglamorous on purpose.
They scoped narrowly. Not a dozen pilots; two or three, each a single well-defined workflow with a named business owner. One pain point, executed well, beats a broad ambition every time.
They set a number first. Cost per successful outcome, error rate on an eval set, hours removed, tickets deflected. A real bar, agreed before the build, so the pilot could be judged on evidence rather than vibes.
They bought and partnered more than they built. They treated the model and much of the orchestration as something to procure, and spent their own engineering effort on the integration into their specific systems, the part a vendor cannot do for them.
They aimed at the back office. They went after process work with a clear cost line: outsourced operations, manual processing, agency fees. Less visible than a customer-facing demo, far easier to prove.
They integrated into the real workflow. The tool was wired into the systems where work happens, given the context to act, and surrounded by the redesigned process, training, and oversight that makes adoption stick. The pilot was treated as the first step of an operating-model change, not as a gadget to switch on.
A pilot that survives: the practical framework
If you are about to run one, here is the discipline the evidence supports.
Pick the use case by value, not by how it demos. Favor a bounded back-office process with a measurable cost today.
Write the success metric before you write any code. One number, a target, a date. If you cannot name it, you are not ready to start.
Decide build versus buy honestly, and bias toward buy. Build only the integration that is genuinely specific to your business.
Plan for compounding error from day one. Keep the agent's autonomous chain short, wrap deterministic checks around it, put a human in the loop at the high-stakes step, and instrument every step so you can see where it breaks.
Fix the data path before you scale. A pilot on clean sample data proves nothing about production.
Resource the process change as seriously as the technology. Redesign the workflow, retrain the people, give managers a reason to adopt it. Most of the value lives here.
Set a kill date. Decide in advance when the pilot must hit its number or be stopped. A pilot that cannot fail cannot teach you anything.
For a longer, gated walkthrough of moving from a working pilot to a production system, see our pilot-to-production checklist.
Where Perform Digital fits
The 95 percent figure is, in the end, an implementation statistic. The model is rarely the thing that failed. The scoping, the metric, the build-versus-buy call, the integration, the reliability engineering, and the process change are. That is the work Perform Digital does: choosing the use case for value, setting the numeric bar, wiring the agent into real systems with guardrails and human checkpoints, and treating the deployment as a change to how a team works. The honest framing is that this is a delivery discipline, not a model purchase.
Future and impact
The 95 percent number will not stay at 95 percent, and the reason is not better models. It is the slow professionalization of how enterprises run this work. CIO coverage through 2026 describes the same shift: from pilot theater toward unit economics, from "look what it can do" toward "here is the cost per successful outcome and the incident runbook" (CIO). The companies still treating AI as a technology demo will keep feeding the failure rate. The ones treating it as a measured process change will keep climbing out of it.
Two risks remain worth watching. The first is agent washing: of the thousands of vendors now marketing agentic products, Gartner reckons only about 130 are doing anything genuinely agentic, the rest being chatbots and RPA repackaged. That makes the buy-over-build advice harder to act on and raises the value of a partner who can tell the difference. The second is that as agents move from single steps to long autonomous chains, the compounding-error math gets less forgiving, not more, and reliability engineering becomes the gating skill for the whole field.
The most important correction is the simplest. The MIT report was read as a verdict on a technology. It is better read as a mirror. It shows enterprises a clear picture of how they run pilots, and the picture is unflattering. The 5 percent did not have a better model than the 95 percent. They had a narrower scope, a real number, an honest build-versus-buy call, and the patience to treat a pilot as the start of a change rather than the end of a demo. That is not a hard formula. It is just an uncommon discipline, and it is the entire divide.
Council summary
This post argues that the MIT NANDA "95 percent" figure is real but routinely misread: it counts pilots that became audited, profit-and-loss-proven production systems, not pilots that delivered any value, so it indicts how enterprises run pilots rather than the technology itself. It then names the recurring failure causes, picking for visibility over value, no metric agreed up front, the learning gap, building what should have been bought, compounding reliability error, weak data foundations, and treating a process change as a technology project, and shows the unglamorous discipline the 5 percent used instead. The reader's takeaway is concrete: scope narrow, set one number before any code, bias toward buy, plan for compounding error, fix the data path, resource the change as hard as the tool, and set a kill date. The honest framing the post insists on is that the divide is a delivery discipline, not a model purchase, and the gap is closing as enterprises professionalize, not as models improve.
Comments