Skip to main content

D is for Data: The Real Reason AI Projects Fail

Between 70 and 85% of AI project failures stem from data problems, not algorithms. Here's what every business owner needs to know.

calendar_today
schedule 14 min read
person James Anderson
D is for Data: The Real Reason AI Projects Fail

Watch the Episode

AI in Business on YouTube

Prefer video? Watch this week's full breakdown on the AI in Business YouTube channel.


Introduction

Today we’re talking about something that does not get enough attention in the AI conversation. We are talking about data. Not AI data, not machine learning data. Your data. The data sitting in your spreadsheets, your CRM, your customer service emails, and your old databases.

I am going to show you why data is the real reason AI projects fail, what it costs businesses, and exactly what you need to do about it. If you have watched my previous videos on chatbots, this episode builds on that foundation. If you have not seen that one, go back and watch it first because we will be building on those concepts here.

The reason this matters so much is simple. You could have the most sophisticated AI model in the world, but if your data is a mess, you are going to get terrible results. The algorithm is the recipe, but your data is the ingredients. And most businesses are cooking with spoiled ingredients.

jto7u

Why AI Projects Fail

Let me start with a statistic that should concern any business owner watching this. Between 70 and 85% of all AI project failures are due to data problems, not problems with the algorithms themselves. That comes from research by Gartner, Deloitte, and McKinsey spanning the last several years.

When AI projects fail, most people assume the algorithm was not smart enough. They think the model was built wrong. But the research tells a different story. The models are often brilliant. The problem is what you feed them.

Think about it this way. You could have the most sophisticated recipe in the world, but if your ingredients are spoiled or poor quality, you are going to get a terrible meal. An AI is exactly the same. The algorithm is the recipe. The data is the ingredients. And most businesses are cooking with spoiled or terrible ingredients.

The financial proof is staggering. The average failed AI project costs around two and a half million dollars in enterprises. A study from MIT in 2025 found that 95% of generative AI deployments showed zero revenue impact. Not because the AI was not powerful enough. Because the data was not ready.

And it is not just about messy data causing poor performance. There are real legal risks. Stateform faced a lawsuit in 2022 over a machine learning algorithm for detecting fraudulent claims. Law tips alleged the algorithm used data about housing and behaviour as proxies for race, subjecting black homeowners to extra scrutiny. The case survived a motion to dismiss. That is a real legal risk from data that nobody checked for bias before feeding it into an AI.

Here is what I want you to take away from this. The algorithms are not the bottleneck. Your data is. And that is actually good news because you can do something about your data. You cannot build a better algorithm than Google or OpenAI, but you can clean up your own data.

1wki6

The True Cost of Messy Data

Now let me talk about something that most AI vendors will not tell you upfront. The hidden cost of data cleanup. A survey found that organisations lose an average of $12.9 million annually due to data quality issues.

Here is the $1-$10-$100 rule that every business owner should know. If you catch a data error at the point of entry, it costs $1 to fix. If you catch it after it is already in your system, it costs $10. If you catch it after a customer sees it, it costs $100. And the cost escalates dramatically the longer bad data sits in your system.

And here is the part that surprises most people. Data scientists, the people actually building the AI models, spend 60% of their time cleaning databases. They are not building models. They are cleaning up mess.

I have seen this firsthand in my own business. We spent months building what we thought was a straightforward AI system for customer service. And when we realised our historical data was a complete mess, inconsistent format and duplicate customer records, customer entries that had been manually edited by different people over years with no standard format, we had to go back and clean house and make sure all of that data was in a nice orderly fashion before the AI could even begin to learn.

The consultant industry has noticed this too. There are firms that make a very good living convincing companies they need months or years of data cleaning before any AI can work. And there is no clear endpoint. Clean enough is undefined. The goalposts keep moving and the invoices keep coming.

I am not saying data cleaning is unimportant. It is. But you need to be realistic about what it takes and not let anyone convince you that they need perfect data before you can get started. Perfect data does not exist and it is not necessary.

Let me give you a real example of what happens when a business gets this wrong. A UK online retailer invested £45,000 in a customer recommendation engine in 2024. They wanted to show customers products that were more likely to buy based on browsing and purchase history. The technology was solid, the algorithm was sophisticated, but within six months they were seeing 25% irrelevant suggestions. Customers were getting recommendations that made no sense and conversion rate dropped by 12%.

The problem was not the algorithm. It was their data. Duplicate customer records. Some customers appeared three or four times with slightly different email addresses. Purchase history that included item returns years ago. Product catalogs with inconsistent category codes from a system migration two years earlier. They scrapped the entire project after six months, £45,000 on AI that could not work because the ingredients were so bad. Then another £20,000 to clean the data to understand what went wrong. Total loss, over £70,000.

Notice something about that story. The technology worked perfectly. If they had cleaned their data first, the project would have succeeded. But they did not, and they lost both the money and the time. And this is a pattern that I have seen over and over again. Businesses spend tens of thousands on AI without checking whether their data is ready, and then they blame the technology when it does not work.

rzcjg

The 80-20 Rule for Data

Now here is where I want to change your perspective. That messy data sitting in your business is not just a problem. It is also actually an asset. It is gold if you know how to use it.

Your customer service emails contain real-world patterns of how your customers ask questions, what their complaints are, how they describe their problems, and what language they use. That is infinitely more valuable for training an AI model than generic data sets from the internet.

And here is how to turn those emails into AI training data. First, collect your historical data. Export six to 12 months of past customer emails. Aim for at least a thousand conversation threads. Second, anonymise the data. Remove personal details like names, emails, phone numbers. This is essential for GDPR compliance. You can use tools like Google Cloud DLP to do this automatically. Third, label and categorise the emails. Tag them as refund request, technical support, billing question, positive feedback. And then this will teach the AI what different types of inquiries look like. Fourth, connect your labelled data to an AI platform and feed it for training.

The results will be significant. Businesses that have done this report automating 40% to 60% of routine inquiries, reducing staff workload by 20 to 30% without losing the personal touch on complex issues. One company saw response times drop from hours to minutes.

Your data is not just a mess to clean up. It is the raw material for building AI that understands your business specifically.

Now let me share something that will change how you think about data. The 80-20 rule. 80% of the value in your data set comes from 20% of the records. Not all data is created equal. Your most recent, most complete, most relevant records are worth far more than the thousands of old entries sitting in your archives.

A thousand high-quality, well-labelled interactions from the last year will train a better AI than 10,000 randomly selected records from the last decade. And this is liberating because it means you do not need to clean everything. You need to just identify the 20% that matters and focus there.

For most businesses, that means your last 12 to 18 months of customer interactions, your current product catalog, and your active customer records.

And here is the related insight. More data is not always better. After a certain point, adding more data gives you less improvement while adding more mess. And this ties into context in the way large language models work. Basically, the more context you put into them, the lower quality the output gets. More duplicates, more inconsistencies, more outdated information.

We made this mistake early on. We kept adding data to our training sets, thinking more would be better, and it did not. We had to go back, reduce the data sets to the highest quality records, and only then did our system start performing well. Do not fall into the trap of thinking you need years of data because you can use AI. You probably have enough already. What you need is clean data.

kqnyf

Spreadsheets: Your Biggest Asset and Problem

Let me talk about your spreadsheets. Because for most small and medium businesses, the spreadsheet is both your biggest AI asset and your biggest problem.

It is an asset because your spreadsheet contains the operational data that AI needs. Sales figures, customer lists, inventory products, project timelines. This is exactly the type of structured data that AI can analyse and learn from.

It is a problem because spreadsheets are notoriously messy, even if you are a bit of an Excel ninja like myself. Multiple people edit the same file with no version control. Inconsistent format and a mix of date formats across columns. Cells left blank because someone did not know what to put there. And typos that break formulas, references to cells that do not exist, and so on.

I see this in every business I work with. The data is there. The value is there. But it is buried under years of ad hoc organisation.

The fix is not to stop using spreadsheets. It is to implement some basic discipline. Use consistent date formats. Do not leave cells blank. Use a placeholder like “unknown” instead. When you rename a product category, update every reference. These small habits make a massive difference when you come to train AI.

One of the things we did in our business was convert our spreadsheets into a custom tool. You can do this using software like Softr or with a bit of coding in Claude Code. It is quite easy to do. Building from spreadsheets into your own tool is surprisingly straightforward. You can customise your own local software, which you as the business owner can use, instead of using those old clunky Excel spreadsheets. They are great and comfortable, and we all like them. But when you start scaling the business, they basically fall away very quickly.

So here is the broader point about data readiness. Off-the-shelf AI tools are built to handle messy data better than custom builds. They have been trained on massive data sets and can tolerate inconsistencies that would break a custom system. If you are an SME, start with off-the-shelf tools. Only invest in custom builds when you have proven the concept and cleaned your data.

We started with a frontier model. We basically cleaned our data and anonymised it all, fed it into a frontier model, tidied it up. But then actually what we ended up doing was building a whole custom white-label ERP system, which is bespoke to our business and handles all the functions of the business.

GDPR and AI: What Happens When Data Gets Deleted

Let me address something that causes a lot of anxiety for business owners. What happens to AI when you delete customer data under GDPR?

Here is a straightforward answer. If you delete customer data that was used to train an AI, the AI does not forget what it learned. AI models are like humans. You cannot make them forget selectively. The patterns they learned are baked into the model.

So this creates a little bit of legal tension. GDPR gives customers the right to be forgotten. But if their data was used to train AI, you cannot easily remove its influence.

One practical solution is anonymisation before training. And this is what we absolutely recommend with any data that you put into a frontier model. Now, if you are using a local open-source model, this is a different kettle of fish because it is yours, it is local, it is not hosted anywhere else. And this is where I believe personally that the industry is going and where security and the future of your sensitive data and your business should be in the future.

So when a customer asks for deletion, if you are using anonymised data, you can honestly say that personal data is not in the system. The AI learned from anonymised patterns, not from their personal information. And this is why the anonymisation step is absolutely critical if you are going to be using a frontier model like OpenAI or Claude or whatever that might be. This section is critical. Do not skip it.

You also need a clear retention policy. Define how long you might keep types of data. Customer transactions might stay for seven years for tax purposes. Service interactions might stay for two years. Marketing data might stay for one. Automate the deletion so you stay compliant without thinking about it.

If you are using a vendor’s AI tool, ask them what happens to training data when customers request deletion. If they cannot give you a clear answer, that should concern you.

Key Takeaways

  • Reality check. Between 70 and 85% of AI failures are data-related problems, not algorithm problems. Your data is the bottleneck. The good news is you can actually fix your data. You cannot build a better algorithm than OpenAI, but you can clean up your own spreadsheets.

  • The 80-20 principle. You do not need to clean everything. Focus on the 20% of your data that delivers 80% of value. That means your last 12 to 18 months of customer interactions, your current product catalog, and your active records. A thousand clean examples beat 10,000 messy ones.

  • The action. Do a 30-minute data inventory this week. List every source of customer and business data you have. For each one, ask: how old is it? How complete is it? How consistent is the format? Who has access to it? That exercise alone will show you where to focus before you spend a penny on AI.

Over to You

Your data is the real bottleneck in AI success, and cleaning it is far cheaper than a failed AI project. Before you spend any money on AI tools or implementations, do yourself a favour and take a hard look at what you are feeding them.

If you want to go deeper on this, I have plenty more videos on the channel about implementing AI in your business without the hype. Subscribe if you have not already, and if you have any questions about where to start with your data, drop them in the comments. I read every one.

In the next lesson, we are moving on to E. We will be talking about ethics. The questions AI raises for business owners, from customer privacy to automated decision-making and how to stay on the right side of them. See you there.

Enjoyed this? Get the weekly briefing.

Every Friday, the AI news, tools and tactics that actually matter for SMEs. One short email. Free.

Free. No spam. Unsubscribe any time.


Share this article

James Anderson

Written by James Anderson

Ex-Royal Navy veteran, electrical engineer, and AI consultant helping SME owners understand and implement AI. Host of AI in Business on YouTube.

Learn more about James →

Want to find the real AI opportunities in your business?

Book a free 15 minute opportunity call. Honest, vendor-neutral advice on where AI fits in your operations and the smartest first move you can make this quarter.