The Stack Overflow Podcast

How AI can help your business, without the hallucinations

Episode Summary

Large language models are all the rage. But they have a bad habit of injecting factual errors, known as hallucinations, into their responses. We sit down with Sascha Heyer, Senior ML specialist at DoIT, to learn how organizations can leverage the power of GenAI while avoiding the downsides.

Episode Notes

DoIT’s sales pitch is simple: they provide technology and expertise to clients who want to use the cloud, free of charge, with the big cloud providers paying the bills.

You can check out Sascha’s writing on machine learning on his Medium blog.  

Connect with him on LinkedIn or subscribe to his YouTube channel


Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content, joined as I often am by my colleague and collaborator, Ryan Donovan. Hey, Ryan. 

Ryan Donovan Hey, Ben. How are you doing today? 

BP I'm pretty good. So we have a sponsored episode today from the fine folks at DoiT, and we're going to be talking about one of my favorite topics: hallucinations. Not the kind of hallucinations I had in college, the kind of hallucinations that are top of mind these days for folks who are working in technology, and specifically trying to take advantage of some of the cutting edge work that's happening in AI. So I would love to first off, welcome our guest to the program, Sascha Heyer. Sascha, welcome to the Stack Overflow Podcast. 

Sascha Heyer Hi, Ben. Hi, Ryan. Pleasure to be on the podcast today. 

BP So for folks who are listening, just tell them a little bit about yourself. How did you get into the world of software and technology? 

SH I started my career working for the agency part of IBM quite a long, long time ago. And approximately six years ago, my wife and I decided to make a bold move and we quitted our jobs back there and moved to Berlin. And the decision was not just this geographic relocation, it’s where I made the decision to fully go into machine learning. And I remember around that time it was when TensorFlow was making the first waves in the industry and with the version one of TensorFlow releasing approximately six years ago. And since then I've worked with over 240 companies solving the most challenging and sometimes also boring machine learning topics. And for the past three years I'm now working at DoiT. 

BP So for folks who are listening, explain DoiT, because it has an interesting sort of business model. 

SH DoiT is a hardware and engineering organization. We have companies on the cloud with all topics around infrastructure, data, and machine learning. The company itself is over a decade old already, so quite some time. We have over 3,000 customers and I see us as an extension for our customers. We are an extension for their teams, and we offer training, we offer support, and we offer all around technical guidance on the cloud, both on a high level but also on really deep dives into the topics. And in addition to our people and that great engineering expertise, we also have a DoiT console where you get cloud analytics for all the major clouds, anomaly detection. That's the benefit the customers get. 

BP And the interesting part that you explained was that the customer doesn't pay you. The cloud providers pay you to help the customer kind of get more out of their cloud? 

SH Exactly. We like to say we do that at no cost for our customers, which is true. Our customers don't have to pay anything. And we have a really short three-page contract and you can leave us at any time. So it's really a no-brainer to join us.

RD So the hot topic in machine learning and AI today is the large language models, and obviously we’re going to talk about the hallucinations. So how do the language models work, in brief, and why do hallucinations happen? 

SH Yeah, so hallucinations are when the model makes up stuff that either doesn't make sense or doesn't match the information it was given. And in such cases the answer sounds plausible, but they're simply incorrect. And I don't like the term ‘hallucinations’ because I think we can put it into a more straightforward way and say that the model is just wrong. That's how it is. There is no such thing as hallucinations for large language models. They are simply wrong. Before we can discuss how hallucinations appear, I think we need to differentiate between two different ways of hallucinations. The first ones are actually made up answers, so answering on really wrong information. And the second one is answering based on outdated data. The outdated data one is easy to fix. You can just train on more up-to-date data or integrate more up-to-date data. But the made up answers is where the issue really starts. 

BP So my sort of impression of how the large language models work is they've been trained to predict the next word in the sentence, the next sentence in a paragraph. They predict the next token and they get better and better at that. And over time, as they've worked with bigger and bigger datasets of language and been fine tuned, they come to feel like they have some aptitude of reasoning and that they're sort of able to use language in the way a person would to work through ideas and give you interesting answers. But to your point, at a certain level, all they're doing is sort of statistically predicting what will come next. So if you say, “Hey, write me a lawyer's brief about the first time we landed on Mars and the seven laws that were established,” it will write that for you. It's not going to say that never happened. What are some ways that you could mediate that? One, as you said, is to obviously keep the context that it's working from up-to-date. The other would be to give it some rules through prompts that sort of instruct it: “If you don't see the answer within the context I've given you, let me know.” But those sort of rely on two different things. There's the inference it's making off the data, and then there's the structure of the prompting. So maybe let's start with the inference part. Explain to folks a little bit from your perspective how it works and how you might set a model up for success.

SH Yeah, as I said, those large language models usually don't say, “I don't know,” when they're unsure. So they give you the most likely answer. As I said, it's because they're operating on tokens and influenced by the sequence of tokens which came before. They give you the next most probable tokens, and that's the crux of those large language models because they're purely relying on patterns. And there's one easy way to solve this inference on prediction time. It's called ‘document retrieval’. And as the name suggests, it involves retrieving relevant documents or information from a database before the model actually generates the response. And this strategy helps in what's called ‘grounding the model’s response’ in actual factual information, and this can significantly reduce or even completely avoid hallucinations. 

RD So how do you get that into a large language model? I mean, the large language models have just been like Ben said, statistically predicting the next word, sentence, et cetera. How do you get them to pull a full document instead? 

SH So there are two ways. As I said, the first one is document retrieval, and the second one is fine tuning, but I would like to focus on document retrieval. When the large language model receives a prompt, we take the document retrieval system and search against the database only the relevant documents which are useful for answering our questions. And we give the documents back from a vector database and provide our model with this context. So the context of the documents are the large language model’s input. And combined with intelligent, good prompt engineering, we can instruct the model to only use the documents we provided, so it's not relying on the information it learned. Instead we say, “Only use the documents which we give to you, and if you don't have the answer, then please tell us you don't have the answer.” So we combine actual factual information with prompt engineering. 

RD Are you embedding the semantic vectors in those documents as a whole, or is it some other process?

SH Exactly. You're taking the documents, for example, some knowledge base on your company. You create an embedding out of it. Then you take the same embedding model and you also take the embedding based on your questions. And because it's the same embedding model, the questions which are correlating to your documents are closed in this multidimensional embedding space, and that's how we get the relevant documents for our question back. 

BP So Sascha, just for folks who are listening who may not be quite as conversant, what's really interesting and what we're seeing a lot of is that folks are moving from this sort of keyword lexical search to semantic search in a vector database. And as you point out, if you did that with some of the documents in the knowledge base of a company, you can put those things in sort of approximate space. But can you explain to folks a little bit about how a vector database works and how you might set up an embedding through, for example, DoiT and a cloud vendor?

SH Yeah, so you take a text and you take the embedding model and you get a great representation of what the model learned based on what it was trained on. So you get a number, a vector representation of the information represented in this large embedding space. And it's stored in a vector database, and you get some nice algorithms to properly get the data out of it. So you're taking two embeddings– the document embedding and the question embedding and do some similarity calculation based on top of that. It could be causing of similarity or thought similarity, it depends a little bit on the model you're using. 

BP And you are from the camp that feels that these models are working purely statistically. I'm a big fan of the Microsoft paper “Sparks of AGI,” and I would say Ryan is more in your camp, but that somehow in the process of understanding the world through bigger and bigger sets of language and more and more fine tuning, they've come away with the ability to have some sort of reasoning, a little bit of theory of mind, a few other little tricks that make them seem more human that gives them sort of a psychological aspect. Now, when you say through clever prompt engineering you can say, “Hey, only pay attention to this, and if you don't know the answer tell us that,” how does a model know if it doesn't know the answer, for example? That's not something where it's predicting the next token. It has to reason about whether or not it has the proper knowledge to provide an answer to you.

SH I also read the paper and it's quite interesting, and he also showcases some of the issues where he still thinks we are not there yet. Because even if a research paper highlights this, we are still not fully there yet. And I agree with you, it feels like magic. If you instruct something and you get the right answers back, or sometimes the wrong answers, but in the end, it's all token-based. And if you already provide the context and there is no probability for the next token to be the next good one, it's all based on probability, so if the probability is too low, there's no good next tokens though. That's how it works. And also you instruct it to model explicitly to answer, “You don't know it.” And this is also the next token. 

RD So when it's hallucinating, it's saying, “Here's the best sort of guess at what a good token is.” Is that correct? 

SH Yes.

BP And you can give it sort of an instruction that says, “If you don't feel that the probability of guessing the next token is above 95%, then just tell us and we'll walk away from this answer.” 

SH Yeah, you can try different ways of telling the model how it should operate. You can tell them to work on probabilities. You can also take it easy and just say, “If you are unsure of your answer based on the context I provided you, reply that you don't have the answer.” It might work well, but there's no guarantee. It's no hard science with prompt engineering. Some people like it and some people hate it. I just call it powerful. 

BP See, if the model feels unsure, then obviously it's just a stochastic parrot, right?

RD Ah, now we’ve got to get therapists for models. 

BP Okay, okay. I'll stop. 

RD So when you're adding context, can you add context from multiple documents at once? 

SH Yeah, so there is no limitation. Of course, there is a limitation in your token limits. So all the large language models have a token limit and that would be 8,000 tokens. So you need to be inside of these token limits, but because we are retrieving only the relevant documents, this token context usually fits. And if not, you still can do a token count and then you need to do some figuring on the documents. Maybe the documents with the less likelihood of a good document with a similar context you just put away. 

BP So let's walk through some of the practical aspects of what a tech stack might look like here. For a client who's working with DoiT who's on one of the big three cloud providers from Google or Microsoft or Amazon, are they using tools from those folks to set all this up? Do they have to go get a separate vector database, like a Pinecone? Do they need to work with tools like LangChain to set up the prompting? Do they need to go on to Hugging Face and find some open source tooling? On a more practical level, how would you go about accomplishing the system we just discussed?

SH Let us take it to Google Cloud. A lot of companies now are moving from OpenAI to Google Cloud because of data and privacy decisions. And Google now released their Bard and also PaLM models, so PaLM is just a more business-focused API for accessing large language models. And you can already use that on Google Cloud. So you have the API where you can send your prompt and you get an answer back. This is the first part you need with your large language model. So with Google it's like OpenAI. They basically have the same API endpoints. That's the first part we need. 

BP Got it. So there's an API endpoint to a well-trained large language model that could be GPT-4 or it could be one of Google's PaLM models as an example.

SH Exactly. And then you need the second part, you need an embedding model. Also OpenAI provides an embedding model, Google also provides a PaLM embedding model. That's the second part. We then take our documents, put it into an embedding and take our question and put it into an embedding. And those embeddings, together with the ID which refers to the document, are stored in a vector database. And also on the Google Cloud there's a product called a matching engine, but you also mentioned Pinecone. Those are all great products to build out the vector database. 

RD You mentioned some of the security of the endpoints. I know a lot of companies are sort of nervous about sending their important documents over endpoints to these models. How do you deal with those concerns and work in security? 

SH The customers we are supporting, they're already on the cloud, so they're on AWS, on Google, or on Azure, and they have their data already on the cloud. What they don't like is if it moves outside of the cloud, because they usually already have some kind of privacy and security approval internally for running Google Cloud. Google Cloud is certified for a lot of different security certificates, so they are feeling comfortable with having the data already inside of Google Cloud and they just want to keep it there. If you're capable of using only Google-related products then you are already a great step ahead.

BP And so, like you mentioned before, one of the things that DoiT does for a customer is say, “Let's evaluate how you're using the cloud and see if there's ways we can save you money or ways we can optimize.” What are some of the considerations a client might have here around the cost and the time of this. Is it every API call you make? Is there cost associated with building out let's say a large vector database? What are some of the constraints that if you wanted to do this within a hundred person startup you might want to think about so that you can use it efficiently and it'll add to your ability to be productive, but it won't end up being a big cost center, a big time suck? 

SH I think you’ll need to decide if you want to use an already existing API like the PaLM API or if you want to train your model yourself. This is the very first decision. But if you want to go maybe the open source way, use the Falcon-7B model, it will cost you a lot of money to train it, or sometimes less depending how you train it, but you need to invest more time. You need to build the model, you need to take care of the training infrastructure. And I would always recommend to go the API way, because Google invested a lot into this PaLM model and why not use it? You can also fine tune it to your specific needs without taking care of any infrastructure. You just provide your data and click on fine tune, wait a couple of hours and you get your model back. So I would always recommend to go this way. And the cost for such infrastructure heavily depends on how many documents you have, how many queries you need to serve per second, and also the PaLM model is based on the character input and the character output. So you pay for the length of the characters you have in your prompt and also for the length of the characters you get back. And that's where there's a big difference, because with OpenAI you are billed on the number of tokens, and with Google you only get billed by the number of characters. And this is a great benefit because English is an easy language, and there you get less tokens. But if you use a language like Italian or French, you have a lot more tokens. So I find the pricing based on characters more fair.

BP What about in German? I know you love to combine four ideas into a really big word. Where would you say Germany sits in terms of cost per language? 

SH Yeah, there's actually a research paper where they did a lot of effort in comparing the token lengths of OpenAI for different languages. I'm not sure where German was. I need to look this up again.

BP That's an interesting thought that maybe different languages have different costs and maybe some that are character-based or something. 

RD What's the most expensive language? I was going to ask, is there a particular project or use case that sort of surprised you, or was there one that sort of tested the abilities of what you had built? 

SH Yeah. Just a week ago we had a company that did a hackathon and we did a workshop with them, a two and a half hour workshop, to provide them a quick start into the hackathon and then we guided them through the hackathon. So we made it open so they could join us and ask us any questions. And for me, the most interesting part was what the teams actually are able to build and in what kind of different departments. It's like HR, back to the support cases, the developers itself, all of them use large language models for very different use cases. In the end, they decided to go a lot of them into production and take this to the next step because it's a big win for the business if you can automate some of the steps. 

BP Yeah, I mean it's really interesting to have you on this podcast. It's kind of serendipitous for us to get the chance to talk with DoiT about this because this is kind of in a way a lot of what Stack Overflow does. We have this big public forum where folks come together, contribute knowledge, try to rate it and sort of in some ways annotate it– is this good knowledge, is this up-to-date knowledge? And then we have this product, Stack Overflow for Teams, which is exactly like you said, how do you get knowledge to move around within an organization between different departments? And it seems like LLMs maybe are providing sort of a new paradigm for how folks might do that, which is really interesting. Your documentation used to be in a wiki here and a Confluence there and a Google Doc here and Stack Overflow for Teams. Now you can feed it all into this one place and just talk to the AI and say, “Do you know this? Where can I find this? Who wrote this part?” stuff like that. 

SH That's what we also do at DoiT internally. We also have Stack Overflow for Teams and we have also Confluence and we are integrating all those different document databases and have one chatbot in Slack where we can ask questions about the documents. You get the source back so you can directly make sure the actual information is correct. It's really useful. You don't have to search anymore, you just get your answers back. It's a completely different way of approaching. 

BP Right. 

RD Oh, I read somewhere that someone said this is the third UI paradigm. 

SH Yeah, I kind of agree with that. Yeah. 

BP And so that's cool. You've been dogfooding this internally you're saying. You have your own version of this that you've been using as well as helping customers build it.

SH Exactly. 

BP All right, so we're getting towards the end of the podcast. Let's have a little fun. You've been working in machine learning, like you said, for six years. Very smart of you, I would have to say, to transition over to this field. I'm sure you're very much in demand. Where are things headed? What are you excited about? What are the next kind of turns of the wheel for what folks are doing? And if you were talking to a customer who was saying, like you said, “All right, we had a hackathon this week. We worked with DoiT, we figured out this is going to be useful to our business. But now we want to get serious. What's a one, two year roadmap look like that we can invest in that's going to take us in the right direction?” from your perspective as someone who's been in the world of machine learning and also someone who's hands-on with customers, helping them build LLMs into an internal knowledge base, what's coming down the pipe? What should they be thinking about? What are you excited about? 

SH I'm really excited about the latest trend into multimodal large language models where you combine different models, like a large image model with a text model, maybe with audio together. And this is, I think, where it's really interesting because this is also more like how we humans think. We can talk, we can hear, we can speak, we can imagine something visual, and I think that's the big next step and I'm really looking forward to what we are getting out there out of this technology. And the rest, who knows? 

BP Who knows? Well, I have read that when you make a model multimodal, for example, if it's not just text, but now it also is image, it seems to gain a new level of reasoning capability in certain areas as if that additional perspective allows it to have sort of a greater, more human-like, let's call it, form of intelligence. So yeah, I think that is very exciting. It hasn't really been released yet, but OpenAI and Google have both talked about having multimodal models and even that video may be the next frontier. So that could be really exciting to see what it does, as you described it, that sort of magic. Let's say it's just all math and tokens in the background, sure, but the experience we have might add a little bit of additional magic to that experience. 

SH And think about it. Just a couple of months ago OpenAI introduced their models. And there was a time before ChatGPT, and now you can hardly work without ChatGPT. So we will see what's coming next.

[music plays]

BP All right, everybody. It is that time of the show. I want to shout out someone who came on Stack Overflow and helped share a little knowledge with the community. Awarded just two hours ago to Victor Moroz, a Lifeboat Badge for saving a question with a great answer. “How can I check whether a string is an integer in Ruby?” If you've ever wondered, Victor has an answer for you and has helped over 17,000 people. So thanks, Victor, and congrats on your Lifeboat Badge. I am Ben Popper. I'm the Director of Content here at Stack Overflow and an AGI superfan. You can always find me on Twitter @BenPopper. You can always email us with questions or suggestions for the podcast. Just hit us up, And if you like the show, do me a favor, leave us a rating and a review. It really helps. 

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow– I'm a little bit of an AGI skeptic. You can find me at

BP Boooo!

SH Thanks, Ben. Thanks, Ryan, for having me today. You can always check out my articles on Medium, or if you are new in generative large language models, check it out. And also have a look into our DoiT engineering blog where we also cover a lot of different topics, not only machine learning but everything on cloud and core infrastructure and data as well.

BP Very cool. All right, we'll be sure to link to Sascha's blog as well as the DoiT blog so you can check out some of their work in the show notes. Thanks for listening, everybody, and we will talk to you soon.

[outro music plays]