Why AI Projects Fail
What the evidence says when you actually trace the citations
I'm sitting in my backyard, phone in hand, dictating instructions to an AI running on a server in the next room. The sky is the kind of clear you get in the desert when the wind dies down. My dog is asleep next to me, twitching at something in a dream.
Over the past couple of hours, a structured research process has produced higher-quality data than I could have gotten any other way — even a few months ago. Fifty-plus sources, structured evidence tiers, convergence analysis across independent studies. The kind of work that would have taken a research team weeks.
And the first thing that research told me was that the statistics I'd been drafting for my own website wouldn't survive scrutiny.
If you lead an AI initiative, evaluate one for investment, or just need to know whether the claims being made to you are real — this is what the evidence says when you actually trace the citations.
The Numbers Everyone Cites
If you've read anything about AI adoption in the past two years, you've seen the numbers. More than 80% of AI projects fail. That one's attributed to RAND. 55% of employers regret their AI-related layoffs. That's Forrester.
I had both in my draft. They supported my argument. They felt authoritative. RAND is RAND. Forrester is Forrester.
When I built the site, I used AI to research those statistics. The citations looked solid. I moved on — there was always something more pressing to build. Later, before publishing, I ran an adversarial review: a different AI model doing a deliberately oppositional assessment of every claim on the site. It flagged the stats. Are these properly sourced? Can you trace them to methodology?
The process worked — it caught the problem. But my first instinct wasn't to question the numbers. It was to tighten the citations and move on. The numbers felt so well-grounded that resisting the feedback felt rational. It's always easier to move forward than to stop and verify.
It wasn't until I decided to run a formal assessment — the same structured pipeline I built for client work — that the citation chain actually unraveled.
The 80% doesn't come from RAND's research. RAND's 2024 report uses the phrase "by some estimates" and cites the number without generating it. The trail leads back to a Gartner forecast from 2018 — and the Gartner language was not "fail." It was that 85 percent of AI projects would "deliver erroneous outcomes due to bias in data, algorithms, or the teams responsible for managing them." A prediction about bias-induced error, not a measurement of project failure. From there it spread: a VentureBeat article reported "87%" based on a conference panel — referring to data science projects specifically, not AI projects. Other publications rounded back to 85%. (The 85% figure also has a second lineage: Gartner issued a November 2017 correction revising an earlier 60% figure up to 85%, which some citations trace rather than the 2018 forecast; the two lineages are often conflated.) By 2024 it had become received wisdom — sourced to institutions that never produced it, restated in language the original sources never used, and generalized to a category broader than any of them measured. Gartner has since issued a 2025 successor forecast that 40% of agentic AI projects will be cancelled by the end of 2027[15] — more narrowly scoped, better-sourced, and still a prediction rather than a measurement.
The 55% linked to a blog post summarizing a Forrester report that sits behind a paywall with undisclosed methodology. The actual sample size for that specific figure has never been publicly confirmed.
A scoping review published on SSRN in August 2025 examined the major failure-rate studies and concluded: "None of the sources employs probability sampling or standardized outcome definitions suitable for population-level prevalence claims."[1] The review remains a working paper — not yet peer-reviewed — and its author, Joe Vallone, serves as Chief Information and AI Officer at Outcomes360, a firm offering AI-in-enterprise assessment services. The affiliation is publicly listed; the methodological claim is independently verifiable: none of the major failure-rate studies discloses probability sampling.
The adversarial review caught the problem before anything went live. The formal pipeline produced better numbers. The process worked. But the resistance I felt — the pull to trust numbers that felt authoritative — that's not a personal failing. It's a calibration problem that everyone using these tools faces. I'm not immune. Nobody is.
What the Evidence Actually Says
When you stop trusting the headline numbers and look at what the studies actually measured, a different picture emerges. It's less dramatic than "80% fail" and more troubling.
The concentration finding. Only 5–12% of organizations achieve significant enterprise-level financial impact from AI — a compressed label: McKinsey measures EBIT impact from gen AI, BCG measures "value realization," PwC measures combined revenue gains and cost benefits, and IBM measures whether AI initiatives have delivered expected ROI. The categories are not identical, but the convergence across them is the robust signal. McKinsey's 2025 survey of 1,993 respondents across 105 countries found 6%.[2] BCG found 5% in a separate survey of 1,250 executives.[3] PwC's CEO survey of 4,454 leaders found 12% reporting both revenue gains and cost benefits.[4] IBM's May 2025 CEO study of 2,000 leaders adds a convergent data point: only 25% of AI initiatives have delivered expected ROI, and only 16% have been scaled enterprise-wide.[16] Meanwhile, a survey of 6,000 executives across four countries found that 90% reported no measurable impact from AI on their productivity or employment over the past three years.[5]
This convergence across independent methodologies is the most consistently replicated finding across the surveys reviewed here. Not "80% fail" — rather, "5–12% succeed at scale, and the rest are in a messy middle of stalled pilots and incremental gains." AI tools adopted for email drafting and meeting summaries, but not for anything that changes how the business actually works.
It's not the technology. BCG's analysis of hundreds of AI implementations identified a consistent pattern: roughly 70% of AI implementation challenges stem from people and processes. Twenty percent from technology infrastructure. Ten percent from the algorithms themselves.[6] A subsequent BCG survey of over 10,000 employees is consistent with that ratio.[7] RAND's interviews with 65 data scientists identified the top cause of failure as "misunderstanding the problem AI needs to solve" — not model limitations or data gaps, but picking the wrong problem in the first place.[8] McKinsey found that whether the organization had fundamentally redesigned its workflows was one of the strongest predictors of enterprise AI impact — more than what model it used or how much it spent. (Strongest among weak predictors: McKinsey's regression explains only about 20% of the variance in outcomes, so workflow redesign is directional rather than deterministic.)[2]
The models work. The organizations don't adapt.
Companies acted on potential, not evidence. This is the finding that stopped me. A study of nearly 6,000 executives across four countries found that more than 90% reported no measurable impact from AI on their employment or productivity over the past three years.[5] When New York's WARN Act began requiring companies to disclose AI as a factor in layoffs, not one of the first 160-plus filers checked the box — including companies that had publicly cited AI in workforce announcements.[9] Challenger, Gray & Christmas tracked roughly 55,000 AI-cited layoffs in 2025, but Oxford Economics warned that companies "dress up layoffs as a good news story rather than bad news," noting that productivity has decelerated, not accelerated — a pattern Fortune characterized as convenient corporate fiction.[10] The pattern is consistent: companies announce AI-driven efficiency gains for shareholders and walk them back in operational reality.
Klarna is the canonical case. The company announced its AI chatbot was doing the work of 700 customer service agents, handling 2.3 million conversations in its first month, and projected $40 million in annual savings. Within a year, customer satisfaction on complex interactions had declined and repeat contact rates increased. The CEO told Bloomberg that "investing in the quality of human support is the way of the future."[12] Months later, the company went public at a $19.6 billion debut-day close on the NYSE (IPO priced at roughly $14 billion in September 2025), citing AI-driven efficiency gains. The CEO simultaneously told Fortune that other tech leaders were "not to the point" about AI's impact on jobs — framing the industry's mainstream posture as one that downplays the consequences.[14]
The collaboration surprise. A preregistered meta-analysis of 106 experimental studies — published in Nature Human Behaviour — found that on average, and with substantial between-study variation, human-AI teams perform worse than the best individual performer. But the direction depends entirely on the task. When AI is already more accurate than the human, adding human oversight reduces accuracy. When the human is the stronger performer, combining them helps. In creative work, humans add value. In routine classification, they subtract it.[11] A December 2025 re-analysis of 74 of those studies confirms the aggregate effect but attributes it partly to experimental-design conventions in the underlying HCI research — which was largely conducted on pre-LLM systems between 2020 and mid-2023.[17] The finding upends both sides of the debate: human oversight doesn't always help, but removing humans doesn't always help either. Whether the pattern transfers cleanly to 2026 agentic LLM systems remains an open question — but the integration-design principle holds. What matters is whether the workflow puts the right performer in the lead for each task. The best sequencing isn't human-plus-AI or AI-minus-human. It's knowing which one should go first, and what the other one's job is.
The Practitioner's Dilemma
There are three layers of verification in any AI initiative, and most organizations skip all of them.
Layer one: Are the goals right? Is this the right problem for AI to solve? RAND calls this the most common failure — bias toward the latest technology rather than solving a real problem. Organizations start with "we need an AI strategy" instead of "we have a problem that might benefit from AI."
Layer two: Are the requirements aligned? Do the success metrics, timelines, and resource plans actually serve the goals? This is where the process saved me with my own website. The statistics I'd drafted served my argument. They felt authoritative. I wouldn't have questioned them on my own — because verification felt like a detour from the work. The adversarial review forced the question. The formal pipeline answered it.
Layer three: Is the implementation proven? Does the system actually do what it claims? Not in a demo. Not in a pilot. In production, under real conditions, with real consequences.
The organizations that succeed — that 5–12% — show the patterns you'd expect from applying this discipline. BCG found that the organizations generating the most value from AI focus on fewer initiatives — an average of 3.5 use cases versus 6.1 for those that struggle — and generate 2.1 times more ROI by going deeper on each one.[13] Sixty percent of organizations lack defined financial KPIs for their AI initiatives, per BCG's 2025 AI Radar. They emphasize workflow redesign over layering AI onto existing processes. And they are willing to stop when the evidence says stop.
A caveat the evidence demands: most of the survey evidence cited here reflects respondents' experience with AI deployments across 2023–2025. The Vaccaro meta-analysis aggregates studies back to 2020 — largely pre-LLM, on classical ML systems. The origin-chain statistics (Gartner 2018, VentureBeat 2019) are older still. The capability curve has steepened across this entire window, and findings that held for 2023-vintage deployments — let alone 2020-vintage ML pilots — may not transfer cleanly to 2026 agentic LLM systems. For straightforward applications, better models will likely close much of the gap on their own — and for many organizations, deploying a good-enough tool cheaply will beat waiting for a perfect one. But better models also encourage larger commitments. The distance between a compelling demo and a working deployment doesn't shrink just because the demo got better.
None of this is unique to AI. It's the same discipline that separated successful software projects from failed ones in the 1990s, successful ERP implementations from disastrous ones in the 2000s, and successful cloud migrations from expensive detours in the 2010s. Define the problem. Verify the claims. Prove it works before you commit.
The tools just made it easier to forget.
The Quiet Hours
The formal assessment that surfaced all of this ran in my backyard over the course of a quiet evening. Eight research agents working in parallel, each blind to the others' findings, each searching for every credible study published in 2025 and 2026. A convergence analysis that identified which findings appeared across multiple independent sources — and which were single-source claims dressed up as consensus.
Without the structured pipeline, the same work would have taken hours of manual prompting and reprompting and reprompting — the AI equivalent of asking the same question slightly differently until you get an answer that looks right. With the pipeline, the process forced citation tracing, evidence comparison, methodology assessment, and explicit uncertainty flagging. The AI couldn't take shortcuts because the pipeline didn't offer any.
The irony isn't lost on me. The same technology that produced the unreliable statistics produced the better ones. The difference wasn't the model. It was the process — the structured requirement that every claim be traced to a primary source, every source be evaluated for methodology, and every conclusion be rated for confidence.
The models are getting better. That's real — anyone using these tools daily can feel the distance between what they could do eighteen months ago and what they can do now. Better models raise the ceiling. But the organizations that hit it are the ones willing to stop and check — even when the story looks good enough, even when the sky is beautiful and the dog is sleeping and there's always something more pressing to build.
That willingness is the whole game. The tools just raised the stakes.
If your organization is navigating stalled pilots, unverified vendor claims, or AI investments that aren't delivering — a Ground Truth Assessment is built for exactly this. Email me a short note about the situation, and I'll tell you whether I think the assessment will help.
Claims Register
The load-bearing factual claims in this dispatch, traced to their primary sources and assessed for methodology, sample size, and potential conflicts of interest. Scenic, authorial, and scene-setting passages (the backyard, the dog, the draft history) are not enumerated — they are unfalsifiable self-report, not evidence. A Ground Truth Assessment applies the same discipline to client work, but against a richer evidence base: your requirements, your codebase, your team, and the specific claims your vendors are making.
Verdicts used below: Confirmed — the figure byte-matches the primary source and is corroborated by at least one independent primary, or triangulates across three or more independent primary sources. Directionally confirmed — multiple sources agree in direction but differ in method or scope; convergence is real but not tight, or the finding is single-primary with convergent circumstantial support. Noted — the figure or caveat is recorded for context but is single-sourced qualitative evidence, a prediction rather than a measurement, or qualified by methodology limits. Debunked — the figure has been traced back to a source that does not support it as restated, or has been retracted.
| Claim | Primary Source | Method | Verdict |
|---|---|---|---|
| "80% of AI projects fail" | Gartner 2018 forecast via RAND 2024 | Prediction, not measurement; original language was "deliver erroneous outcomes due to bias," not "fail" | Debunked — citation chain traced in dispatch; semantic shift documented |
| "55% regret AI layoffs" | Forrester Predictions 2026 | Analyst prediction, no disclosed sample | Debunked — widely misreported as survey finding |
| 5–12% achieve significant financial impact | McKinsey n=1,993; BCG n=1,250; PwC n=4,454 | Three independent executive surveys, 2025–2026 | Confirmed — convergent across methodologies |
| 90% report no measurable AI impact | NBER WP 34836, n≈6,000, 4 countries | Academic survey (Bloom, Davis et al.) | Directionally confirmed — single-primary with convergent circumstantial support (NYC WARN, Challenger, Oxford Economics) |
| 70% of challenges from people/process | BCG consulting case work + BCG AI at Work 2025 (same-source) | Practice-derived 70/20/10 heuristic; BCG's 2025 employee survey is consistent with the ratio but is not an independent corroborator | Directionally confirmed — single lineage (BCG case work + BCG survey); body text softened to "consistent with" |
| Top failure cause: wrong problem | RAND, 65 practitioner interviews | Qualitative research | Noted — small qualitative n, consistent with other sources but not independently corroborated at scale |
| Companies cut on AI potential, not evidence | NBER, NYC WARN filings, Challenger, Oxford Economics | Convergent circumstantial evidence | Directionally confirmed — 5+ independent sources agree |
| Human-AI teams underperform best individual | Vaccaro et al., Nature Human Behaviour, 106 studies; Berger et al. Dec 2025 re-analysis of 74 of those studies | Preregistered meta-analysis + re-analysis confirming aggregate but flagging design bias | Confirmed — effect depends on task and baseline; literature base is pre-LLM (2020–mid-2023), 2026 transfer is an open question |
| Klarna: AI savings claim, then reversal | CEO statements, Bloomberg, SEC filings | Primary source documentation | Noted — well-documented but N=1; case study, not prevalence evidence |
| Focused AI (3.5 use cases) yields 2.1x ROI | BCG, n=1,803 executives | Executive survey | Directionally confirmed — single BCG survey, not independently corroborated |
| Evidence reflects 2020–2025 AI deployments, not uniformly 2023–2024 | All cited sources, per-source field-work review | Temporal scope check (corrected 2026-04-19): NBER fielded Nov 2025–Jan 2026; PwC Sep–Nov 2025; Vaccaro lit window Jan 2020–Jun 2023; Gartner 2018 / VentureBeat 2019 older still | Noted (corrected) — caveat reframed to match actual source windows |
| Gartner 2025 successor: 40% of agentic AI projects canceled by 2027 | Gartner press release, June 25, 2025 | Analyst forecast (prediction, narrower scope) | Noted — successor to the 2018 85% line; still a prediction |
| IBM CEO Study: 25% of AI initiatives deliver expected ROI; 16% scaled | IBM Institute for Business Value, May 2025, n=2,000 CEOs | Executive survey | Directionally confirmed — single IBM survey; functions as convergent corroborator for the 5–12% finding above |
| Vaccaro aggregate re-analyzed, design bias flagged | Berger et al., arXiv:2512.13253, December 2025 | Independent re-analysis of 74 of 106 Vaccaro studies | Noted — aggregate finding holds; experimental-design caveat added |
How this dispatch was made
Every Shifting Ground dispatch goes through a 5-stage editorial pipeline: Sift (find the threads), Forge (expand into a draft), Thrash (adversarial review for voice and honesty), Rattle (final coherence check), and Lint (publish-readiness). The AI assists with structure and pacing. The voice is human. Learn more about how this site is built.
Raw Seed
Six threads from an evening research session. ~400 words.
Core Tension. Everyone cites the failure rates. 80%. 85%. 95%. The numbers vary but the narrative is consistent: AI projects fail. A lot. But when you actually trace the citations, a different picture emerges — one that's both less dramatic and more troubling than the headlines suggest.
1. The numbers are unreliable. The 80% traces to a 2018 Gartner forecast. The 87% traces to a conference talk. The 95% uses a definition of success so narrow that most successful transformations wouldn't qualify. No study uses probability sampling. A 2025 scoping review concluded these figures should be treated as "hypothesis-generating rather than inferential."
2. What IS defensible: Only 5-12% of organizations achieve significant enterprise financial impact from AI. The concentration effect — very few succeed, most are in a messy middle — is the most replicated finding.
3. The failure isn't the technology. BCG found 70% of the challenge is people and processes. RAND's #1 cause: "misunderstanding the problem AI needs to solve." The models work. The organizations don't adapt.
4. Companies fired people for what AI MIGHT do. Only 2% of AI-driven layoffs were based on actual AI implementation. 60% were "anticipatory." 73% of companies that replaced workers with AI broke even or lost money. [Retracted — the HBR and CareerMinds sources for these figures had undisclosed conflicts of interest and selection bias. The revised finding — that companies cut on AI potential rather than measured impact — is supported by the NBER, WARN, and Oxford Economics evidence in the published dispatch above.]
5. The collaboration surprise. Multiple studies show that adding humans to AI decision-making actually REDUCES accuracy. The best pattern isn't human+AI. It's AI first, human review.
6. What comes next. When agents can build AND assess their own output, the need for structured verification doesn't decrease — it increases.
Note: Threads 4 and 5 were revised after the claims register identified methodology problems in the originally cited sources. Thread 4's HBR and CareerMinds sources had undisclosed conflicts of interest and selection bias. Thread 5's demand forecasting claim was not in the cited source. See the claims register above.