Three years ago, reviewing thousands of documents in a single day was simply impossible. Today, with AI, it’s not only possible, it’s something we do routinely. But “being able to do it” and “doing it well” are two very different things, and the distance between them is exactly where the real work lives.
A box of surprises in every run
Not long ago we worked on a project that involved analyzing large volumes of documents to extract very specific information. The challenge wasn’t the volume itself, but the nature of the input: we received nested folders with documents of different types, in different formats, and with no guaranteed structure. Every run was a box of surprises, because we never knew what we’d find inside; the only thing we were sure of was what we were looking for.
The temptation, when you face a problem like this, is the obvious one: dump everything into a model and ask it to solve it. And it doesn’t work.
Capability is not the solution
The fact that a model can read one document doesn’t mean it can process ten thousand reliably. Doing it for real is engineering: you have to design how to parallelize the work, choose which model to use at each step (because they don’t all serve the same purpose nor cost the same), build the prompts carefully, and measure results. And above all, you have to audit, which is the point almost no one talks about and at the same time the most important one.
The problem no one sees: visual inspection doesn’t scale
When you review ten documents, you read them, and if the model got something wrong you catch it by eye. When it’s ten thousand, that’s over. You’re not going to read ten thousand documents to confirm the model didn’t make a mistake, and if you can’t verify the result, you don’t really have a result: you have an expensive hunch. We’ve lived through this before, in another technological wave.
The big data lesson
When data grew, exactly the same thing happened. For years, analyzing data meant opening a spreadsheet and looking at it row by row, sorting, filtering, and reviewing by eye; it worked because the scale allowed it. But when volumes exploded, that artisanal intuition stopped working, because no one can eyeball a million rows. So the discipline shifted toward sampling, statistics, validation, and anomaly detection: we stopped trusting direct inspection and started trusting method.
AI at scale is at exactly that same breaking point. The mistake is treating ten thousand documents with the same intuition we used to review ten, when the scale changed and the tools have to change with it.
Why auditing AI is harder than auditing data
With big data, the data was the data: once validated, it didn’t lie. With generative AI, by contrast, the system that produces the results can also be the source of the error, and in ways that aren’t obvious at all. There are three concrete traps we learned to watch for.
The first is that hallucinations become invisible at scale. A hallucination in a single document you catch by reading, but one among ten thousand slips through without anyone seeing it, because it doesn’t shout: it disguises itself as a correct answer. At scale, the risk isn’t that the model fails loudly and obviously, but that it fails quietly and in silence, in that one percent of cases you’ll never read.
The second is that models make things up when you ask them for a quantitative judgment. At arithmetic they’re already reasonably good, but if you ask a model to give you a “score,” a rating, a grade from one to ten, that’s where it invents with total confidence and hands you a number that looks objective but actually came from nowhere. The problem is that a number looks serious, looks measurable, and it’s precisely where you can trust it least.
The third is that parallelizing breaks the holistic view. To process at scale you have to split the problem, and each agent ends up seeing its piece in isolation; that gives you speed, but it creates a silent risk, because if the analysis you need is holistic, the sum of correct partial answers can produce a wrong global conclusion. Each part is right and the whole is wrong, and that isn’t a problem of the model but of design.
What auditing really means
Auditing at this scale isn’t reviewing faster, it’s reviewing differently, and in practice it means several things that work in layers. Cross-verification means not trusting a single pass of a single model, but contrasting results across different approaches and looking closely where they disagree, because that’s where the error usually hides. Specialized agents that audit other agents let us separate the one that produces from the one that reviews, since an agent designed specifically to look for failures finds things the one that generated the result will never see. Statistical sampling consists of taking representative samples and reviewing them thoroughly to estimate where and how much quality degrades, just like in industrial quality control. And measurable, repeatable orchestration is what keeps the process from being a black box that produced a result once: it’s about building a system with traceability, where you can reconstruct how each conclusion was reached.
None of this is magic. It’s systems engineering applied to a component that, unlike traditional software, is probabilistic.
The model will improve. That won’t save you.
It’s true that models will keep getting better and that every year they’re more capable; that’s a given. But betting everything on the next model solving your problems is a strategy, not a solution, and it’s a bad strategy at that. Solving real problems today isn’t waiting for the next model, but knowing what you’re doing, understanding the limitations of the tool in your hand, and having the method to verify that what it produced actually works.
The capability is available to everyone, and that’s why the difference between a demo that impresses and a system you can trust isn’t in the model, but in knowing what you’re doing.
That’s what we do at Redstone Labs. If you’re wrestling with a similar problem, let’s talk.