The Stack Overflow Podcast

How do you fact-check an AI?

Episode Summary

Ryan chats with Amr Awadallah, founder and CEO of GenAI platform Vectara. They cover how retrieval-augmented generation (RAG) has advanced, why fact-checking and accurate data are essential in building AI applications, and how Vectara’s Mockingbird model seeks to minimize hallucinations.

Episode Notes

Vectara is a platform-as-a-service that allows users to build AI assistants and agents. Get started with their docs.

This interview was recorded at HumanX last month. They are gearing up for next year’s conference on April 6-9, 2026.

Follow Amr on LinkedIn.

Episode Transcription

[intro music plays]

Ryan Donovan Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm here broadcasting from the HumanX AI Conference, and today we're going to be talking all things RAG– retrieval augmented generation. My guest is Amr Awadallah, co-founder and CEO of Vectara. Welcome to the show.

Amr Awadallah It’s great to be here. Thank you.

RD Obviously RAG has been the paradigm of choice for a lot of AI for a while now. I think once people saw AI hallucinate, they were like, “We need a way to back this up with sources, with facts.” What's changing about RAG since it first was developed?

AA So it's the realization that first RAG is retrieval and then generation. The generation is becoming better and better, to be honest, in terms of if you give it the right things in the context window, it will try to stick to what's in the context window as much as it can, and that's called the grounded hallucination rates. So the grounded hallucination rates are a measure of, if I give the large language model the truth, meaning the facts, the needles in the haystack, and then I give it the question that is around these facts, that then it would produce a response that sticks to these facts. Vectara, we actually have a leaderboard that we published. It's called the Hallucination Leaderboard. If you just search on Google, it's the first result you're going to get. And that measures all of the large language models out there and how likely they are to hallucinate within the grounded context, which by the way, is bound of hallucination, meaning if you remove the facts from the context window, then they actually hallucinate more. So if you ask them to make conclusions based on what we call their ‘parametric knowledge,’ meaning the things stored inside of the large language model’s ‘brain,’ then they're more likely to hallucinate. So even with grounding, they still hallucinate. The best of models last year were about 5% hallucination rate with grounding. So even when you ground them, 1 out of every 20 tokens coming out might be completely wrong, completely off topic, or not true. This year, 03 and Gemini 2.0 from Google, they broke new benchmarks and they're both around 0.8%, 0.9%, which is amazing. That said, I think we're going to be saturating around 0.5%. I don't think we'll be able to beat the 0.5%, and the reason why is because, at the end of the day, the larger language models intrinsically inside of them, they are probabilistic. They're applying probability functions on the outputs of the neurons and trying conversion based on that. So yes, you can align them to try to stick to the facts as much as you can, but every now and then there will still be something off. 0.5% is amazing for consumer applications. So if I'm doing a consumer app, or even marketing or writing a novel, writing a book, writing an article, a blog, we're more tolerant of errors there. That's what I'm saying. But if you're doing a medical diagnosis, no, no, no. If you're doing a legal brief, no, no, no. If you're doing manufacturing maintenance of equipment, no, no, no. If you're doing a government investigation to arrest somebody, no, no, no, et cetera, et cetera. So there is many, many fields where no, that 0.5% is not acceptable, so then that led to us having to do a lot more than just RAG. We need to do RAG with fact-checking the same way that when you write an article yourself, or maybe for you, you're a smaller shop, but I don’t know, maybe you do that, you'll have a fact-checker that will read what you wrote. And you don't want to read it yourself because if you read it yourself, you're not going to catch the mistakes, because they're your own hallucinations.

RD I believe me.

AA Exactly. So you want to have an independent fact-checker that goes through what was written, and that's exactly what we do at Vectara. So at Vectara, we plug into any large language model, by the way. So we have our own, it's called Mockingbird, which is a model that we fine-tuned specifically to minimize hallucinations, so we can get back to that later, but we can plug into any model– a Gemini, OpenAI, Anthropic, DeepSeek, Meta, you name it. We can plug into any generative model. What we are about is the scaffolding around that model to first, on the input, extract and retrieve the most relevant needles in the haystack for the task or question that you're issuing, the prompt that you're issuing, and then on the output, correlate that response back with these needles in the haystack to make sure that that response on the outputs did not deviate. And then we issue what's called a ‘factual consistency score’ that says, on a scale of 0 to 100% where 100% means perfect, it did not deviate at all, not a single word was outside of the words of the facts, versus 0% meaning it added a lot of things, and we do them on a per-sentence basis. So that’s the main change.

RD So it’s interesting to have the RAG plus fact-checking. Weirdly, fact-checking has been controversial these days, and I imagine with AI, it's even more. There's a trust gap. How do you fact-check with an AI?

AA Yes, that's an excellent question. Now there is two forms of fact-checking. There is open-ended internet fact-checking, meaning the internet knowledge at large, as in what did Trump say today? That's a very open-ended question, and it's open to interpretation in many cases, and that's where fact-checking becomes ambiguous. We're not building our systems for that use case. That's what Perplexity is trying to do, or Google now with the new AI mode they just launched. It's amazing, by the way. The speed of AI mode at Google is mind-boggling how fast it is, you have to try it. So what they're trying to do is trying to do fact-checking across the internet at large. That is a really hard unsolved problem, to be honest. The problem that we're after is, no, we're going after a limited domain dataset. It's not the open internet, it’s the data that you have in your organization. Either you're building an app with that data or you're building something internally to your company with that data. So I'll give you a couple of examples. The internal example first, this is a manufacturing customer that we have. The manufacturing customer has hundreds of thousands of workers in their factories. Their factories make bottles– bottles of water, bottles of drinks, any kind of bottle, and sometimes the machines in the factory will fail. The workers are not technicians. They don't know how to fix the machines, so they have to call up the technicians. The technicians can take two days to show up, which is downtime. So that's the problem, the headache that they had. So what did they do with us? They took the manuals of the machines as is, they took the troubleshooting and maintenance tickets of how the technicians repaired the machines in the past, loaded all of that with images, with diagrams, with text, loaded all of that into our system, and now the workers in the factory have an app just like ChatGPT. Where they take a picture, describe the problem, and they get back the perfect fix for that machine, if the AI knows how to fix it. If it thinks it's going to hallucinate, it says, “I don’t know how to fix this. Go call the technician.” And that’s how be balance these two things. So the benefit of that is we took the average worker from being a worker to being a technician. This is really what we have done here. And the benefits for the business is threefold– less downtime, because now they can fix the machines faster, they don't have to pay the high cost of technicians coming over, and the workers get upskilled because the workers, actually, as they fix the machines, they get the skills, they learn how to fix it themselves. So that's an example of an internal use case. That's where it's inside of your company using it to make your workers in the factory better, your HR team better, the knowledge shared between your engineers better, et cetera, et cetera, research, and so on. An external use case is a company called SonoSim. They're based here in the US, and what they do is they help radiologists use ultrasound machines. So when you're using an ultrasound machine, you have to calibrate it first. Calibrating an ultrasound machine is an art. It's a science, actually, not an art, but it's really hard. It's really hard because you have to take into account the model and the manufacturer of the machine, the age, race, gender, demographics of the patient, and are you scanning for pregnancy, the heart, the lungs? What's the modality that you're after in terms of what you're diagnosing? So the expert radiologists, they know that stuff solid. They go in, bim, bim, bim, machine is configured. The average radiologist, they just pull their hair out. They don’t know what to do. So same thing, SonoSim, what they did is they built an app for radiologists. They took all of the manuals of these machines, all of the best practices from the experts, from the super, super ninjas that know how to configure these machines, loaded all of that up, and again, now they sell an app to their customers that allow the average radiologist to be an expert radiologist overnight. They just plug the app in and they're up and running with that. So these are the examples.

RD It's interesting that the AI machine learning is very good at that sort of gathering expertise.

AA Exactly. Yes.

RD The expert radiologist has seen everything. But my understanding of most RAG systems is that they basically show their work of sources.

AA Yes.

RD In the ultrasound one, does it show sources?

AA Yes, absolutely. You're absolutely right. Showing the sources of the data is essential because even though we're doing the fact-checking, we're fact-checking against the sources. What if the sources are wrong? So you always want to show where it came from so once you find the mistake, you can fix it. And that's exactly what we do. In our outputs, every sentence we link it back to, in this conclusion, this is which paper, which document, which paragraph, which sentence we depended on to make that reasoning and add it here. And one of the funny examples I always love to share that stresses the importance of this was actually from Google from last year. So Google last year, they launched AI within Search, and then somebody was asking it, “I'm trying to cook a pizza, and every time I cook my pizza, the cheese falls off the pizza, of the bread, of the pasta. How can I make it stick?” And the AI replied back and said, “Put some super glue between the cheese and the bread.” And we all initially thought that was a hallucination. It was not hallucination, it was a data lineage issue. So what happened there is there was somebody on Reddit that posted that question, and then an evil human being in a very sarcastic way replied back saying to put super glue. But the thing is, all the other humans found that answer so funny that they gave it thumbs up so it got like 30,000 thumbs up. So the AI saw that and said, “Okay, that must be the right answer.”

RD I read plenty of ones where AIs do not understand sarcasm on Reddit.

AA They're getting better now, but you're right. Human sarcasm is very evil.

RD How many rocks is it okay to eat a day? So for fact-checking the sources, it all comes down to showing where they got it from a human, right?

AA Yes, the provenance and lineage.

RD The attribution.

AA Yes, exactly.

RD And finding that human at the end of it is still important to AI. Do you think there will come a time where we'll trust AI enough to not have the attribution?

AA Yes. That time will come. It's a matter of when, not if it will come. The problem we all have is our data is a mess. This is really the problem. One of the taglines I love, which is actually by Informatica, a very traditional ETL company, is ‘Everybody is ready for AI.’ Everybody is ready for AI, except your data is not ready for AI because data is a mess everywhere. And by definition, garbage in, garbage out. If you load in crappy data, like just put super glue, then you're going to get a super glue response. But that said, as we start using these systems and we start to have proper controls like we provide from our platform, then we are fixing all of the source data over time. So over time, the data will get to a very, very clean, pristine state, plus we're adding a lot of capabilities in the core platform itself to help resort and reorder the data on the fly such that the most important needles in the haystack are higher up versus lower down. And then we tell the LLM to give more preference to the higher up needles. So how are we going to clean up our data? Longer term, we'll make sure all the data is in a clean form, but that will take a very long time because actually data's a mess everywhere. All companies have data and many duplicates, wrong versions, blah, blah, blah. It's really a very bad state that we're in collectively as an industry. One mechanism that we have as an intermediate solution to refine that is ranking the results when we retrieve them. Vector databases are very good at finding information but not necessarily ranking the information. So after we retrieve them, we have very special re-rankers that re-rank the data in the right order. And the customer using our APIs, they can define extra, we call them ‘chain re-rankers.’ They can give you the Fs, define functions, to change the ranking formulas and say for example, “I trust Ryan a lot more than I trust Chris,” so then the Ryan results will be higher up.

RD Take that Chris!

AA Or if I'm in HR compliance, my HR compliance document from last month should supersede my HR document from last year. Or if I'm in eCommerce, a review from a customer that received 30,000 thumbs up should be ranked more than a review that only received two thumbs up, unless it's a sarcastic review. So you want to be able to define all of these things, and that will now help you at least during the retrieval stage to do some cleanup dynamically as data is coming in, and that helps a lot. Does that make sense? That's the impediment actually in front of everybody is how do we get our data in a clean form.

RD I mean, that's something we are very much concerned about, the data behind the AI. Obviously Stack Overflow provides a lot of the data for the AIs.

AA And I'm sure you have lots of clean data, but you might have also some bad data as well where somebody might upload something that is nefarious or maybe has a back door that they want to get somebody copying the code.

RD We have moderators that are pretty good at that.

AA Exactly. If everybody had active moderation for every single piece of content being generated in their enterprise, we won't have this problem, but unfortunately only you guys have that.

RD That's right. That's the future. That's the work we're all going to be doing is moderation. I want to go back to something you said. You fine-tuned a model to remove hallucinations?

AA Yes.

RD How?

AA Yes. So that model is called Mockingbird. So first, before I go to that model, that's the generative model. There is four core models that make up any RAG pipeline– or any decent good RAG pipeline. There is the retrieval model, the embedding model– we call it the embedding model– that generates the factors and allows you to retrieve the needles in the haystack. Our model is called Boomerang, and we built that from scratch. And then we have another model that ranks the needles in the haystack. So vector databases are very good at recall, very bad at precision. So after they return their embeddings, we have to sort them, so we have another model that does the ranking. It's called Slingshot. It's based on BERT actually, which is a technique from Google. And then after that comes the generative model– we'll get back to that in a second– and then at the end you have the hallucination detection model that can detect which parts of the response stuck to the facts and which part of the response deviated from the facts. So these are the three models that we built from scratch at Vectara. For the generative model, we initially were thinking we're going to build one from scratch, but then first it's very expensive to build one from scratch as you know. It costs at least $50 million to build one, but then there is now so many good open source ones out there. So we took a model that's called Qwen from Alibaba, Ali Cloud. It's, in my opinion, the best open source model there is right now, even better than DeepSeek. DeepSeek was very good with marketing, by the way, but Qwen is really, really good. And then we fine-tuned it specifically for the task of answering this question as a function of these facts without hallucinating. So what I mean by that is, most of these models, when they're built, they're being built primarily for the consumer in mind. They're being incentivized to be a know-it-all. They're being incentivized to answer as many questions as they can, to rarely say, “I don't know,” because that makes the model look bad. It's very similar to how when you take a high school exam and multiple choice exam and there's no penalty for picking the wrong answer. You're going to pick random letters for the ones just to maximize your grade. That's exactly how the cost function of these models was built. So we take that model, but then we retrain it, fine-tune it with another cost function that penalizes for wrong answers. So when it gets the answer right, we give it a cookie and when it gets the answer wrong, we slap it on its hand. So that now makes it avoid answering when it doesn't know. So when it doesn't know, I'm afraid of being slapped on my hand right now, I'm not going to give this answer. That’s exactly how we built Mockingbird, so that further reduces the hallucination, so when you add that, when you couple that, meaning a model that does not hallucinate as much, with excellent retrieval to find the needles in the haystack, and then excellent sorting to sort them in the proper order, then you'll get a response and have a very, very low hallucination rate, around 0.8%. And then you couple that now with hallucination detection, then you can remove that 0.8%. So now you're left with very good answers all the time, and then for some of the questions we’ll say, “Sorry, you need to go figure the answer yourself.”

RD So do you have zero hallucination?

AA It's not zero hallucination. What we're doing is we're suppressing the hallucinations. The hallucinations are happening, it's just we're catching them. What we're trying to do as an industry, if you look back at these two examples that I gave of the workers in the factory and the radiologists using a radiology machine, they are this, so what is this? The movie The Matrix, do you like The Matrix?

RD Sure.

AA It's my favorite movie of all time– the first one. The Matrix 2 and 3 and 4 is like, “What the heck is that?” But 1 was amazing, truly amazing and legendary for its time it came out and everything. So there's a very key scene in that movie, and a key sentence, actually a top quote from the first movie where Keanu Reeves– Neo– he plugs into the AI and he spends eight hours plugged into the AI and then he comes out of these eight hours and says–

RD “I know Kung Fu.”

AA “I know Kung Fu.” Exactly. And before he went in, he didn't know Kung Fu. That's the most popular line from the movie, and that's exactly what we're all trying to do. What we're trying to do is take the average person who's not good at Kung Fu, fixing a machine, setting up an ultrasound machine, creating a legal draft, and make them do that at the expert level or very close to the expert level. This is really what we're doing here.

RD Give them the information enough to really perform, even if it's sort of a Chinese wall where they're not exactly able to perform at the expert level, but with the instructions.

AA Exactly. And what's different about this wave we're in right now compared to the previous waves is, one of the examples I also like to give always is sometimes I give these talks and people will tell me, “Oh, we would never listen to AI. We'd never let AI tell us how to do our jobs. Oh, we’ll ignore it.” And I understand, I empathize with that theory of AI because of Terminator and Matrix and all these stupid movies that make us afraid of it, but then I ask them this question. I say, “How many of you used Google Maps at least once in the last month?” Everybody raises their hand, everybody. Literally every hand goes up. I'm like, “Okay. Who do you think at Google is sitting down and creating the path for you that is optimized for minimizing time, taking into account construction, maintenance of roads, traffic conditions, weather conditions, and get you that perfect route?” Do you think it's a human being sitting at Google doing that? No, it's the AI that's doing that. And when Google Maps came out, it took us from being average. All of us were horrible navigators.

RD The folding and unfolding map.

AA And having quarrels with our significant other in the car and asking people in the streets. So thanks to Google Maps for saving humanity. But essentially there were some of us back then that were experts like the cab drivers in London. They actually, as part of their certification, they had to remember 80,000 street names and shortcuts by name, otherwise they don't pass the certification exam. And overnight we now became as good as them. Literally as good as them, and we can go from the east coast to the west coast without looking at the map once, and we can do it today.

RD I mean, we're outsourcing expertise.

AA Exactly, but that is coming back to us. It's imbuing us with “I know Kung Fu” skills. I did not know how to navigate and now I know how to navigate. I did not know how to fix a machine in a factory and now I know how to fix the machine in the factory. So that's exactly where we're all going. Now, the difference between the previous wave and this wave is with the Google Maps wave, that was purpose-built AI just for that task. The genius of this new wave, the reason why we're all super flabbergasted by this new wave we're in right now is the transformer model of LLMs can do anything, literally anything– move robots, move their hands around, learn how to do that, and jump around and do summersaults, how to fix a machine, how to diagnose a patient, how to write a legal contract, how to create new types of materials, how to create a song, how to create an image.

RD So on the “I know Kung Fu,” one of the things I think I'm afraid of for the future is that we'll know Kung Fu, but we won't understand it. There was a little short blog post talking about how new software engineers who are using LLMs don't really understand the code they're pushing.

AA You are right. That is a concern, but my answer to that is during education we have to be careful. We have to educate people on the fundamentals first before we release and let them use this technology, in the same way that we are when we are learning arithmetic in first and second and third grade, we don't use calculators. We're not allowed to use the calculators until we understand the first principles, and then once we have understood them, okay, now we use the calculators to do much harder problems. And that's how it should be with software engineers as well. They first should understand how to code really well, and more importantly, actually how to architect, which is the harder task. But our job in the future is not going to be how to code. Our job in the future is going to be how to architect and how to prompt correctly to get what we want. And I see that as a natural progression. In the same way that when computers came out, I saw that briefly, I don't want to age myself, we had to program them with punch cards. Punch cards for God’s sake. You had to hold a card and hold a pen and punch holes for every bit.

RD I think my dad has a few of those at home.

AA Oh, really? You should keep them a few more years, you'll sell them for a lot of money on eBay. And then we migrated from punch cards to do Assembly language, which was very, very hard to program in Assembly language. And then we moved on from Assembly language to C, C++, and then we moved on to Python which is amazing. And now we’re going to be able to program using English language. But we still have to architect properly. If you look, what's happening to us is we're pushing down the mundane to the machine and leveraging us for the more complex, or we're managing the machine, I can look at it that way. We're managing the machine to do our job. I'm with you 100%.

RD Higher level of abstraction, right?

AA Yes. But if we're not careful about learning, when we learn computer engineering and computer science at schools, they still teach us Assembly language first so we understand the fundamentals of how first level cache, second level cache, third level cache works. Because if you don't have these fundamentals, you're not going to be able to architect well and create a high performance artifact later on. So I agree with you that we should be very careful to do the education first before we let people use the technology at the right level. So I agree with you there, otherwise it can backfire in a very bad way. What’s the movie– Idiocracy? That movie is hilarious and is exactly this point. If you continue to depend on the technology to do everything and you don't teach yourself properly, then we’ll become idiots essentially at the end of the day.

RD That's right. What are you most excited about for the future, whether for retrieval augmented generation or AI in general?

AA So for AI in general, it's this “I know Kung Fu.” I really think the “I know Kung Fu” thing is going to unlock so much human potential in a way that we have not seen before because there is so many domains where people would love to be more creative and contribute to, but can't because they don't have the fundamental training to be able to do that. For example, 20 years ago, you couldn't have done what you're doing with me right now and make a podcast. You couldn't have done that. Only CNN and NBC with big budgets and massive studios could have done that. Right now you can. Okay. Same thing today. You cannot create a movie like Avengers or The Matrix. In five years you will be able to. Today you cannot create an amazing game like Cyberpunk. In a few years, you'll be able to just by yourself. So that unlocks a new level of creativity and participation from any human, but with the caveat of it's only for the humans who are willing to take advantage of that. So my advice always to people is embrace this wave, learn these tools. In the same way you learned how to do podcasting and it's working for you, or somebody like Mr. Beast on YouTube learned how to do it and now he's a multi-billionaire because of that, catch that wave, catch that wave, catch that wave. This wave is going to enable you to do new things in new domains that you never thought possible before. So that's what excites me about the AI movement overall. For my business and my company, building AI agents and AI assistants that are reliable, accurate, secure, and extendable is at our core. Last year, we spent a lot of time building these amazing systems, but last year most of the customers we engaged with– we focused on the large enterprise– were just trials. They were trying things, a small POC there, a small experiment there. They were not serious, serious. And then towards the end of the year, October, things just flipped. They're now starting to see other people doing things and we need to be ahead of the curve now. So what excites me about this year is we now finally moved in the guerilla curve. Do you know the guerilla curve of marketing? We are now crossing the chasm. I can tell you that for sure. So we are crossing the chasm, meaning the gap between the very, very early innovator adopters who were willing to try anything, to the pragmatics, the early pragmatics who were willing to try out something for real in their business, but that's still the beginning. We're still at the beginning of the wave, but once you see that happening, then now you know you are in a good wave that will keep going. So that's what excites me about what we're doing.

[music plays]

RD All right, everyone. It's that time of the show. I am Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you liked what you heard or if you didn't like what you hear, email us at podcast@stackoverflow.com. And if you want to reach out to me directly, you can find me on LinkedIn.

AA And I'm Amr Awadallah. I'm the CEO and founder of Vectara. We build AI agents and assistants that are reliable, accurate, secure, and extendable. And you can find me in any social media by my last name– Awadallah, and you can also find Vectara on most of the social media with the spelling ‘Vectara.’

RD All right, I'm going to look you up on Friendster. Thanks for listening everybody. We’ll talk to you next time.

[outro music plays]