Your AI Project Is Only as Good as the Data Behind It

Monika Kotus
4 days ago
7 min read

For most of the AI projects I run now, the hardest part is the data – whether it's accurate, current, and something a team can actually trust. Models, integrations, and the tools themselves have become routine. The data is where a project works or falls apart.

Gartner has predicted that through 2026, organizations will abandon 60% of AI projects whose data isn't ready to support them.* That matches what I see in my own work.

So I want to share how I work with data – collecting it, checking it, deciding what's good enough – because this is where most of the real effort goes.

What I actually do in projects

A large part of my project work now comes down to one thing: gathering data and preparing it so it can feed a client's AI project.

Some of that data comes from the client. Some comes from public sources. A lot of it already exists inside the organization and has simply never been used – old project archives, reports, analyses, and notes sitting in folders nobody opens. We pull the useful parts into one place and build the project's knowledge base: a setup that reflects that specific organization or research team, so the AI works from their reality.

In practice, I often build this as a Project in Claude. When the knowledge base is small, the model works with all of it directly. When it grows large, Claude moves to retrieval, known as RAG, pulling the most relevant pieces for each question instead of holding everything at once.

For some projects, I also use NotebookLM, another grounded retrieval tool. It works only from the sources you upload and answers with citations pointing back to them, and it handles a genuinely large set of documents – recent versions handle up to 300 sources in a single notebook. Its real strength is speed: searching a big body of material and quickly pulling together what it says. I build a full Claude Project when the work calls for deeper reasoning over the material, and I reach for NotebookLM when a project needs fast, wide search across a lot of sources.

Getting data ready for that knowledge base is a large part of the work in itself – cleaning it, reshaping it, checking it. Much of that processing I do in Claude Code, which is well suited to working through data files directly.

This is where the data question becomes unavoidable. RAG retrieves from what you give it. If the underlying data is wrong, outdated, or contradictory, the AI passes that straight through, written fluently and with full confidence. The quality of the writing tells you nothing about the quality of the source.

Two times the input data nearly derailed a project

Here are two examples from real projects.

A literature review. I was working with a research team, building on a reference database of papers, authors, and links. I made an assumption I no longer make: that because the database had been handed over, it was correct. Several entries had the wrong authors, or links that didn't exist. The model, working from those references, started producing citations that didn't hold up. The references it was given were wrong, so its output was wrong.

When the errors surfaced, we went back and rebuilt the base properly. I used Claude Code for this. I had it search and verify online whether each paper actually existed, confirm the real authors, and correct the citations against reliable sources. We didn't have the PDFs ourselves, so Claude Code also retrieved the underlying information from the appropriate places. It was slow, careful work, and it turned an unreliable base into one we could trust.

What I took from it is simple: verify the base before you build on it. The cross-check is the work, and skipping it just moves the problem downstream.

A business case. For a startup preparing to enter a market, I was building a business case from a mix of data, some of it gathered through deep research tools. One source distorted everything: a regulation that was about ten years old, with nothing to mark it as outdated. The deep research had picked it up and treated it as current, and the model surfaced it as one of the central insights of the analysis.

That single stale source sat right at the foundation of the work. The budget, the resourcing, the core assumptions of the project were all being shaped on top of it. I caught it because it didn't reconcile with anything else I had, and I excluded it. If it had stayed in, the whole business case would have rested on a fact that stopped being true a decade ago.

Two different projects, one root cause: the input data was not what it appeared to be.

Why data quality is the job

I spent several months on a full-time data science program, and the most useful thing I took from it was a way of working with data before trusting it.

Much of that work is unglamorous, and most of a project's time goes into it. Here is what I pay attention to.

Exploring the data before committing to it. Before a project really starts, I go through the data myself to see what is genuinely in it. The real picture is often different from what people assume. This step alone can end a project in two days instead of six months, and that is a good outcome when the data won't support the goal.

Looking for the gaps. Data that looks complete often isn't. I work a lot with researchers, universities, and grant teams, and I regularly find that something a project needs simply isn't in the material I was given. When I notice a gap, I go back to the team and ask about it directly, and that is usually when people start to remember. A lot of what a project depends on was never written down anywhere. It lives in people's heads, and it only becomes usable data once someone asks the right question.

Finding the duplicates. The same thing shows up across systems under different IDs. A document saved as "report", "report final", and "report final v2" gets read as three separate sources. Cleaning this up changes what the AI thinks it knows.

Checking freshness and origin. Old data needs to be labeled as old. When it isn't, it behaves like current data and does real damage, as it did in that business case. Knowing when a dataset was created, and where it came from, matters as much as what it contains.

Watching for bias. Data carries the patterns of how it was collected. A model learns all of those patterns, including the ones nobody intended. If the data leans a certain way, the output leans with it.

Checking sources against each other. For a knowledge base feeding RAG, the question is whether the documents are consistent with each other and still current. When two sources say different things, the AI answers differently each time, because there is no single version of the truth for it to land on.

Working with a smaller, higher-quality sample. Working with data never really ends. And collecting more of it does not always mean a better result. RAG has its limits too, and it is often worth working with a smaller, higher-quality sample of data.

Running through all of it is a willingness to say the data isn't usable. Excluding a weak source, or pausing a project because the foundation isn't there, is part of doing this well.

Precision matters more now

There is a second reason data quality has moved to the center of my work.

The newest models, Claude Opus 4.7 in my case, follow detailed instructions far more closely than models did even a year ago. With a large, well-built knowledge base, the AI now stays with the logic of a project in a way it couldn't before.

That puts new weight on precision. When a model follows instructions closely, an error in the instructions or in the data shows up quickly and clearly in the output. The instructions I write need to be exact, and the data underneath them needs to be clean. There is less room for "roughly right" than there used to be.

Testing is where the real problems show up

Even with careful preparation, you don't catch everything upfront.

It is usually in testing that the real issues appear. Something was added that shouldn't be there. Something is missing. Sometimes a bias has crept into the knowledge base and is shifting how the model responds. So there is always a review loop: a person looks at the output and asks whether it's right, or whether something is off. That loop is how the project gets better.

Read the key documents together

One more practice I always recommend, for companies and researchers alike.

A knowledge base can be too large to read in full. The key documents, though, should be read carefully, with the client or the researcher present. An AI can misread or slightly reinterpret a central document, and that small shift can change the meaning of an entire project. Catching it at the foundation saves a lot of rework later.

The same principle, everywhere

For a company, the knowledge base might be structured internal knowledge for one department, used to generate new materials reliably. For a researcher, it might be the body of evidence behind a publication or the knowledge base for a grant application.

The principle holds in every case: the AI on top is only ever as good as the data underneath it.

Collecting and cleaning data is ongoing work. For any organization or researcher serious about working well with AI, that steady effort on the data is where results come from.

If you're building something with AI and want the data foundation to hold, I'd be glad to talk.

hello@monikakotus.com · monikakotus.com

Source – Gartner, "Lack of AI-Ready Data Puts AI Projects at Risk," press release, 26 February 2025: https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk