On this episode: Stack Overflow senior data scientist Michael Geden tells Ryan and Ben about how data scientists evaluate large language models (LLMs) and their output. They cover the challenges involved in evaluating LLMs, how LLMs are being used to evaluate other LLMs, the importance of data validating, the need for human raters, and more needs and tradeoffs involved in selecting and fine-tuning LLMs.
Connect with Michael on LinkedIn.
Shoutout to user1083266, who earned a Stellar Question badge with How to store image in SQLite database.
[intro music plays]
Ben Popper Don’t start building your AI app from scratch– save time and effort by visiting intel.com/edgeAI. Get open source code snippets and sample apps for a headstart on development so you can reach your seamless deployment faster. Go to intel.com/edgeai.
Ryan Donovan Hello, everybody, and welcome to the Stack Overflow Podcast, a place to talk about all things software and technology.
BP Ryan, thank you for queuing this up. You connected with our guest, so why don't you introduce and start the conversation and then I will join in where I can.
RD Today, I am joined by our Senior Data Scientist for the data platform, Michael Geden. Hi, Michael. Today, we're going to be talking about how to evaluate large language models and their output. Welcome to the show.
Michael Geden Hi, Ryan. Well, thanks for inviting me. Happy to be here.
RD So at the beginning of these episodes, we like to sort of talk to our guests and ask how did you get into technology and how did you get where you are today?
MG Sure. So I had a pretty circumspect approach towards data science. I started sort of getting interested in stats during my PhD in Psychology and realized that the analytical sides of things interested me more and I ended up pursuing a master's in statistics and really had a great time with it. And over time, I ended up doing a postdoc in an AI and education lab in telemedia, and that worked on AI-based solutions for supporting educational outcomes. And that's how I, over time, progressed, and throughout I just realized I really enjoyed the analytical side of things, but also a bit of data skepticism that always comes from any social science field.
RD Sure. Well, there is a PhD to data science pipeline that I've seen. That's interesting you're part of that. So here at Stack Overflow, you've been looking at how we're using large language models and how we can evaluate the quality of the output. So how exactly do you evaluate LLM output?
MG So evaluation is at the core of any model that we want to be either using or building, and LLMs introduce some additional complexity to it as they belong to a larger class of models of generative models. They're generating some form of new media, and then that, I think, is more challenging to evaluate than what we might think of as classical ML producing a regression or classification with a more singular output. So when we're building things out, evaluation is at the key, but these LLMs have new challenges involving how to control for that. And we've been using a method that has been rising in popularity of LLM as a judge, where we use another LLM to evaluate the generations of another LLM and that can either be through sort of a singleton evaluation where you just have the generation reference guided, where you have sort of a reference answer that we might be interested in that can ground it, or a pairwise comparison where you have two and ask for the preference of them. But that has its own challenges too, which I'm sure we'll get into.
RD Well, the reason we have to analyze the LLM output, I'm assuming, is that it's because it's sort of nondeterministic, this statistical prediction.
MG So the nondeterminism is definitely one challenge, but another challenge is the breadth of the content that can be produced. So we have various aspects we might be interested in, so question answering is a classic example. So we might be interested in making sure that the generations don't have any toxicity, that they're well formatted and easier to read, that the content is accurate. Or if you're summarizing it, you might want to consider attributes of did you maintain the key points, is it hallucinating? So there's a number of attributes we might be interested in considering, each of which has their own properties.
RD And so if the LLM has a lot of stuff it's drawing from, why is another LLM able to judge? Aren't you just using the broken thing to judge the broken thing?
MG So that is definitely at the heart of a lot of concerns about it, but I think that it is a different task, and that's worth keeping in mind and thinking of, in that one is having to generate a response and one is attempting to discriminate. And you might think of them as different skill sets, just as a skill set to write a book is different than the skill set to judge and say which book you prefer. The requirements of both can look rather different. So it's certainly not outlandish to say that something may generate something poor but be reliable in evaluating those generations, but to that point, we want to avoid a turtles on turtles situation where we're using a black box to solve a black box. So if you have a synthetic data generation, then you have an LLM generate off of that, and then you have evaluation, you can lose sight easily of what's your North Star. And so when using an LLM as a judge, we want to validate that compared to some source of truth like we would for any metric. So for example, if we were to just consider the vague concept of accuracy of the content, we might have human labelers go in and label some content and then use that to validate the LLM as a judge to make sure that its performance is similar. But that often doesn't break down as nicely as you might think. So for example, we have different raters and those raters have different skill sets, so within the perspective of Stack data, we might think of a task of summarization or something that was explored as part of the search experience. You would need someone with enough subject matter expertise to label that data for that to be a reliable label. And you'll likely want to have multiple raters to calculate inter-rater reliability so that you're seeing how reliable is the LLM as a judge with a human rater relative to how reliable are the human raters to each other.
RD Well, the human raters are nondeterministic too, right?
MG Yes.
RD So you mentioned accuracy, toxicity. What are the other things that you can use LLMs to evaluate other LLMs on?
MG So it really just depends on the problem you're working on. So generally it's something that is probably not going to be too surface level, so it'd be silly to do something like a readability score because we have very cheap and fast algorithms to do that. So often it's more of a complex metric that's content centric where simpler methods won't really suffice. So I think that's why for summarization, a key point analysis of how complete is this answer relative to the original ones, or how accurate is the summary relative to the sources, are things where you're not going to really be able to approximate that as simply.
RD Whenever I look at this, I hear human eval is the benchmark of choice for code evaluation. Are there other standardized benchmarks that you use and that other people use?
MG So there's a lot of standardized benchmarks, but here within the context of Stack application, we're interested in those benchmarks being used to compare models, and we're selecting which ones we want to start with. But at the end of the day, our goal isn't to run human eval, it's to perform the task where we're interested in applying this model, and that's going to be the primary thing. So we might use human eval to select, say, three candidate models that we're going to start with, and then from there, have a benchmark that we've made internally based off our own data to ensure the performance is similar. So for example, with human eval, you might see most of the code is in Python or something like that. And we have on our site a large number of different languages and packages, and some of which are high resource that are commonly seen like Python and Java, and others that are very fairly rarely seen. There's not a lot of resources or it has very recent information that might not be captured in the model. And so we're going to be interested in evaluations across different aspects of it as we want to think about the total user experience. So for example, if we had an accuracy of 80% across some high resource and low resource things, but most of that accuracy is within the common things and the performance is very poor in the less common things, that leads to pretty poor user experience. So those are some of the aspects we might be interested in.
RD Besides feeding it more data, are there ways to get it better at the low resource things?
MG For low resource, there's the question of the generations and the evaluations. So for the context of the evaluations, I think that we have to keep in mind that, while the discrimination of that relative to performance is what we are interested in as opposed to generation, and that can be an easier task, if we don't have enough information, we still can't do it. So you wouldn't ask a human labeler to evaluate something that they've never seen before because they probably wouldn't provide a very reliable evaluation. So similarly, when we're thinking about it, it's the same way where we want to look at the reliability across different facets.
RD Right, you have to have somebody trained in it.
MG Right, because to your question about how well can we have the generations work for low resource languages, we can't answer that within the context of our stuff until we get the evaluation piece. So usually where we start is, can we evaluate it reliably to make the choices that we want to, and then once we get through there, then we get to the generations of how to improve performance and tune it for the use cases we're interested in.
BP So do the evaluating LLMs need to be custom trained, or can any general purpose one work? The machines that are looking at the machines, custom or off the shelf?
MG Well, that's where your validation is the first thing that comes key so you need a way to choose whether or not your LLM as a judge is providing the expected performance. And then that comes down to really costs, latency, and performance trade-offs. If you're seeing that Model A has high reliability and high validity, but it's costing a lot of money, then you might be interested in something with a moderate reduction in reliability but a much faster throughput and lower costs. So that's where a teacher/student model or fine-tuning a model specifically for evaluation. Prometheus is an example of where you can make a smaller model that can approximate a larger one specifically for the purpose of evaluation could come into play, so it depends on the trade-offs you're interested in.
RD You mentioned we had a custom benchmark. Did we actually train up a LLM for that or do any fine-tuning or anything?
MG So for the moment, no. Right now, it's focused on the evaluation. Part of that is that we're being very intentional on where we want to apply it and putting a lot of safety guards on it, so making sure that any LLM is being applied in a way that the community would positively receive it. So the majority of the focus right now is on evaluations.
RD That brings me to a gap in my own knowledge. When we talk about these benchmarks, how do you evaluate against that data? Is it in a custom database somewhere? Is it fine-tune trained? How does that data apply to the evaluating LLM?
MG So the process looks something like you at the very beginning define a suite of models you might be interested in. We’re exploring a mixture of open source models and closed source ones. Then you're going to run each of those through with whatever your generation prompt is going to be on the specified task. So for the example of summarization, now you have three models: A, B and C. You've got three generations per sample across each of those, and then you're going to have your LLM as a judge go through and produce a score for them and then you can use that to produce a ranking across the candidate models for that task. And then if it is generalizable, then you should also be able to get a rough concept of how it would apply when it would be deployed. And of course, offline eval is always an incomplete stand-in for other online testing, but at this stage, that's what it would be used for. So a lot of that ends up being custom.
BP Are there best practices and open source options when it comes to the evaluation side of things that you can draw on? Which is to say, all right, we know Stack Overflow wants to do some work in the realm of code gen. As we do that, we know that on Hugging Face we can find XYZ model which is good at evaluating against code gen. Or, all right, we know Stack Overflow for Teams wants to do, like you said, knowledge retrieval synthesis and summary. Out there, there's a couple of open source models that have been shown to be good at this, and of course, obviously you could do some in-house or supplement that with HLRF and stuff like that. But do you find that there's tools that are emerging or best practices that are emerging that you can lean on or is it kind of all greenfield and experimentation and you're mostly left to your own devices?
MG I'd say that the main thing is these things are changing all the time. The benchmarks that are used are changing, the models’ performance and the trade-offs that they have are constantly changing, so really rather than naming a specific model, I'd say the key is having a process in mind of defining what are the places you're gonna be looking for your model, ideally a benchmark that's related to the task you're doing, such as code generation or human eval, that may or may not generalize to the case that you're interested in. So for example, if you're generalizing to low resource languages, you're going to see that there's going to be some weaknesses there on how that would apply to your things, or if you have a recency effect that might also come into play, so just keeping that in mind. And then publications, things are coming out constantly. So I'd say for choosing your model, it's really just making sure you have your key application in mind and you're staying up to date on the new content that's coming out because it's changing all the time.
RD So what are some of the hot new eval techniques that are coming out? What are the changes that are of the moment?
MG I think that they’re getting to be a little more nuanced. So there were some early papers in the last year that were quite exciting sharing that LLM as a judge had high reliability, such as 80% agreement with humans who had the same. While that was said, they had a couple of weaknesses they noticed such as position bias where certain positions would be preferred by the LLMs, or verbosity bias, longer responses were preferred, or self enhancement bias where you would see that the model would prefer its own generations. And that was really on two different tasks, and since then, there's been a couple more papers that are coming out exploring it and there's been a little more realization of the nuance to this. It sort of depends on the task you're doing. If you have more of a needle in a haystack situation where a long context window can distract from the task, in that case, few shot learning might be challenging so you might have to do more of a reference guided approach where you just have a simple description of it. And the small changes in wording can be there. So you might make a small change in wording to your evaluation prompt that makes it change in relation to how reliable are those, and then that can be an issue when you're using the same human validation over and over again. So if you get 500 human ratings, depending on how you break that down, that might not be very much on some of the categories that you're interested in looking at, and if you keep looking at that over and over again as you iterate on your prompt, you might accidentally end up over-tuning that prompt too much for those specific cases, and then you get another batch of labels and degrade your performance. And while it's not exactly overfitting in the sense that we're not training a model, you are over-tailoring that prompt to that problem.
BP I feel like me and LLMs are the same at heart. I prefer my own outputs, especially when they're very verbose. That's usually what I would tend to think is the best, but not everybody agrees.
RD It's interesting that LLMs have their own biases too.
BP We were discussing LLMs as eval and what they're able to do. Are there LLMs that you can use to help guide other parts of the process? For example, one of the things that I've read can create a much more positive user experience in terms of the final output you get for the query is having a lot of prompt engineering behind the scenes that takes their prompt, structures it, passes it off, gets a response, structures the response, and brings that back. Are there automated ways to do that or is that all something that's being done manually at the moment?
MG There are certainly automated techniques to doing it that you can use, again, sort of in the using LLMs for every stage being an easy answer of using LLM to try to generate.
BP LLMs all the way down.
MG But because there's no way to know whether your prompt is improved without an evaluation, so going back to that, that's why this remains probably one of the first steps and one of the ones that can take a fair amount of time, depending on the complexity of your scenario.
RD So you can't just say, “ChatGPT, make this evaluation better, make this prompt better.”
MG The temptation can be to get a couple of labels, see that it works pretty well and then move on and then you get to preproduction a couple of months later and a lot more people are poking at it and then you realize, “Oh, well, there's all these weaknesses that we hadn't considered and they're going to be harder to address at that point.”
RD I know there's been some talk about training LLMs on LLM-generated data, not poisoning the model, but making it perform worse. Is there any risk of that with LLMs just evaluating themselves forever? Is there a pit they're going to fall into?
MG I think there's definitely a risk of lowering the diversity of information. And when we think about a lot of this, we want to think about what is the data generating mechanism that we're really interested in. So we're often interested in handling noisy user-based generated data, so we might think of the example of a search. Say you don't have a lot of search queries, you can use an LLM to generate some search queries in relation to a document and train it and that may or may not work, but that's where it's important to make sure that the synthetic data is as close as possible to the data generating mechanism that you're interested in, such as the users. So for example, when we apply that to Stack data, we end up missing a lot of the clever ways that our users are using filters and applying various tags and other types of information that they're looking for. And given the keyword-based search queries, the LLM-generated synthetic data doesn't approximate that data generating mechanism very well. And so if we were to sort of fit towards that synthetic data, it would make a poor performing model relative to the user content while it would perform relative well to the LLM. And we see the same thing for LLM evaluations where if you have human-labeled data that is following a different data generating mechanism than the one that you're going to be evaluating in, they might not be as useful. So you might be interested in validating question and answer and wondering, “Well, is it going to be worth all the effort in producing these human labels?” and be tempted to get the accepted answer, you could find questions with multiple answers associated, and then match the accepted answer to the least common answer, produce a pairing, and have the LLM as a judge state which one is the preferred, and then use the wisdom of the masses to say what we want to approximate, what the community is saying, and then that could be a way to do a study around the consistency of your LLM as a judge. But the problem with that is that the content that the users are producing is likely going to be a data generating mechanism from the model. So for example, code just not running from an LLM or it hallucinating is going to be a type of error that we won't expect as commonly to occur in answers that match our filtering conditions within the answers on the site. And so that type of error isn't going to be captured or included in those reliability estimates.
BP But how many of the LLMs ideas will be closed as duplicate? I'm just kidding.
MG Only 80% of them will be flagged.
BP Yeah, or downvoted to obscurity. I think about this a lot, and not to be dystopian, but sometimes it feels a bit like the scene in the matrix where it's like, “Okay, the AIs will continue to advance, but they do need human beings, as you said, to bring in fresh perspectives or a certain kind of knowledge or maybe solve novel solutions.” And then I just imagine Keanu Reeves and all those people like they're the little battery in the cell and we're just the brains that are needed to produce some form of energy for future LLMs.
MG Who knows, maybe our emotions are just the random noise needed to do random number generations eventually just like the lava lamps.
BP Ideas are just a random storm of numbers passing through different synapses. I tend to feel that way more and more every day. But to Ryan's point, there was that famous paper, ‘Textbooks Are All You Need,’ and they looked at Stack Overflow and this other data source called ‘The Stack,’ and then they were like, “All right, well once we're trained on that, we can generate 10,000 synthetic couplets of Q&A based on that.” And then if you train a child LLM based on that data, it performs as well as a much bigger model when it comes to these coding tasks. And so in that case, synthetic data proved to be maybe equally useful. Now, as you said, that misses a lot of things like novelty, for example. It's not keeping up with new languages, solving novel problems, et cetera, et cetera. And now we're going to go off course for a second, but one of the most interesting things to me recently is generative AI's ability to propose novel solutions. They solved an unsolvable math proof that had been hanging around for a couple hundred years, they came up with a better algorithm for the traveling salesman problem. In those instances, however, I think there's a bit of spray and pray. The Gen AI comes up with thousands or millions of possible solutions, winnows it down to maybe the top 100, and then humans go in and sort of try to help guide to the final answer or whatever it may be. But I do think Ryan's point about synthetic data and LLMs’ ability to generate novel outcomes makes me think in the future they'll be used to help generate a lot of the training data themselves
MG Absolutely, and we're already seeing that and that is a highly valuable technique. Just as in the examples that you provided, a key when doing synthetic data generation is making sure it's capturing the problem you want to. So if you can have a way to discriminate the output and limit it to the things that are relevant or functional, then you can do a spray and pray of having it produce a million solutions and then whittle those down with whatever method. But for that to have value, you either need a way to make sure that those million are all valid, or you need an external way to validate those to go down to something useful. So it certainly has its applications.
BP It's a useful technique when the checksum at the end is something that can be done repeatedly and at low cost, right?
MG Right.
BP One question I have is, as folks are listening to this and maybe within their own organizations they're being asked to consider Gen AI or to work on it, they're being pressured to come up with ways to figure out if this is a technology their organization should be leveraging or can get value out of, Michael, we're talking about evaluation, but what's a good framework in your mind to say, “My organization might benefit from generative AI and here's how I would go about doing the discovery, the experimentation, the initial tests, and if it all works, the deployment.”
MG I'd say the first question is pretty key, which is, does it actually need an LLM? Does it actually need to have something like that? Because for many problems, while an LLM can produce a good answer, you can have a lower cost solution for pennies on the dollar that will get a similar performance. So I think the first thing is to see if this is something where you need to generate new unstructured content in some form and that unstructured content is going to provide direct value. And the next is, is this something that is going to provide enough of a business impact to merit the cost? So there's going to be a fair amount of cost, even if you just use something off the shelf involving deploying it and maintaining it, evaluating it, updating it. So once you feel pretty good on those, you have your business impact, you know that this is the right tool for the job, the next part is developing an evaluation framework. That can look like many different things. LLM as a judge and validating it is certainly one, making sure your scope is tight. So it might be tempting to have ten different applications of an LLM and have different versions, but each of them is going to require their own care to make sure that they're performing their purpose adequately. So starting small and iterating quickly. Similarly, once you have that, making a choice that there's going to be a lot of new attractive models that are going to be quite expensive and large, and is that actually what you need? What's your risk tolerance relative to your latency, your costs, and the performance that you actually require at the scale that you're going to be deploying this?
BP I've heard from a couple of people– Ryan and I were chatting with some folks from MongoDB and Google Cloud– that there is this ongoing work as there is with any large service, like Stack Overflow has in our site reliability engineers and stuff. But when we're talking about a Gen AI application in production, what are the things that you need to work on– the maintenance, the cost, tech debt that might accumulate, errors that might pop up. Can you talk me through a little bit of just what that overhead is like?
MG Sure. So you have to have the API that you're going to deploy. You need to have the metrics around it. So you're going to have a mixture of offline evaluation metrics to compare different models to see if new candidate models come up, whether or not you're going to switch to them, AB testing across various KPIs that would be related more towards user metrics and a means to deploy them and compare them, data logging to have something so that you can make sure that you are maintaining the performance that you would like to, an evaluation framework around how well are these things actually performing. If you don't get the performance that you want, you're also fine-tuning, but then depending on your sensitivity to recent data or your situation specificity, you might also need to be doing ongoing fine-tuning, which depending on the format that you're using an LLM, if it's more open-faced like a chat bot or if it's more closed-face like a specific task, then that also has attributes related to risk for security. And the content guardrails that these models are based around can erode when you're doing further fine-tuning on your own data since you're not going to be mimicking that sort of process. And so the more you fine-tune it to that, the more risky that might be. So again, depending on your application, that can change as well. And then similarly, the code bases, a lot of these things are changing so you might find that the tooling that you originally had, there's now something more mature that's more attractive that would be better to switch to. So there's a number of places that come into play where you have to consider.
BP That's a good summary. As an organization is evaluating all of this stuff, and like you said, weighing the cost, the ROI and all that kind of stuff, what do you think is the calculation you would make? Let's say you were a CTO at an organization doing it yourself, in-house, on-prem to whatever degree, versus an API with just some RAG setup or going to a MosaicML and saying, “Here's my data. Can you help me sort of build a model, fine-tune it, and then you operationalize it and I just pay you sort of as a cloud provider.” How would you think through that calculus?
MG I would definitely think through your throughput and your capacity, because if you're going to be the one maintaining that model, that's going to be ongoing effort as models change and you have to switch the new one and fine-tune. And it is certainly worth always starting with something pre-canned. So the answer should always be that you start with something that's out there and then if you're not getting what you need, then you can open the door to, is it worth fine-tuning something either to get a better performance or to reduce costs through using a smaller model? And that's where the throughput comes in. If you see a lot of throughput coming in, those costs can skyrocket quite quickly when using an API that you're paying for, and is that going to pay itself off and in how long based off of the expected hours to maintain that you are going to see?
BP How do you think about hosting or cloud costs if you're not doing an API? You have your own but maybe you're hosting it somewhere and you're managing the throughput and worried about latency or the time from query to response. What are the mechanisms or the costs involved there? The general hosting and then the throughput and inference costs is what I must be referring to.
MG Once you get into the instantiation details of those sorts of things, that's where usually on our team, we pass it off to machine learning engineers. So I can't speak as in-depth to that side of the house, other than that those are critical decisions as it impacts the data scientists on understanding whether you should at all. So those are the questions you'll want to answer ahead of time.
BP Okay, that makes sense.
[music plays]
BP All right, everybody. It is that time of the show. We’ve got to shout out a Stack Overflow user who came on and helped share a little curiosity or knowledge and that benefits all of us here at the community. So congrats to User1083266 for asking a stellar question, getting that badge. “How to store image in SQLite database.” Asked 12 years ago. 400,000 people have benefited from your curiosity, so we appreciate you coming by and asking the question. It seems like there's some good answers on here. We’ll put it in the show notes. As always, I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can find me on X @BenPopper. Hit us up, podcast@stackoverflow.com. We're bringing on developers who have hit us up recently or doing topics and questions that the audience suggests. And if you enjoyed the program, leave us a rating and a review because it really helps.
RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And you can send me some DMs on X @RThorDonovan.
MG My name is Michael Geden, and my title is Senior Data Scientist at Stack Overflow. And I don't use the social media very much, so LinkedIn is probably your best bet.
BP Nice. All right, everybody. Thanks for listening, and we will talk to you soon.
[outro music plays]