The Stack Overflow Podcast

How to train your dream machine

Episode Summary

Ben and Ryan talk with Vikram Chatterji, founder and CEO of Galileo, a company focused on building and evaluating generative AI apps. They discuss the challenges of benchmarking and evaluating GenAI models, the importance of data quality in AI systems, and the trade-offs between using pre-trained models and fine-tuning models with custom data.

Episode Notes

Galileo is an end-to-end platform for GenAI evaluation, experimentation, and observability. Learn more by exploring their docs.

Galileo’s Hallucination Index is a ranking and evaluation framework for LLM hallucinations (it includes a blooper reel).

Connect with Vikram on LinkedIn.

Stack Overflow user Petr Janeček won a Lifeboat badge for answering Null array to empty list, a question that’s helped more than 47,000 other curious folks.

Are you a software developer? Take Stack Overflow’s annual survey about how you learn and level up, which tools you’re using, and which ones you want most. You can check out the results of previous surveys here.

Episode Transcription

[intro music plays]

Ben Popper Better, faster, stronger AI development with Intel’s Edge AI. Visit intel.com/EdgeAI to accelerate your AI app development with Intel’s Edge AI resources. Access open source code snippets and guides for YOLOv8 and PaDiM models. To deploy seamlessly, visit intel.com/EdgeAI now. 

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague, Ryan Donovan. 

Ryan Donovan Hello.

BP Ryan, you and I have chatted a bit with folks internally and externally about the difficulty of doing benchmarking and evaluations in the Gen AI era because this stuff is all so new and there aren't a ton of best practices, and most difficult of all, I think the person mentioned last time, this is nondeterministic software. Ask it the same prompt seven times, you might get seven different answers. My goodness, that's confusing. 

RD Hallucinations, yeah. 

BP So today we are lucky enough to have Vikram Chatterji as our guest from Galileo, which is a company focused on building, and most importantly, evaluating generative AI apps. Vikram, welcome to the Stack Overflow Podcast. 

Vikram Chatterji Thank you so much. It's super great to be here. Thanks, Ben. Thanks, Ryan. 

BP So for folks who are listening, give them a little bit of your background. How'd you get into the world of computer science? What led you to move to this more specialization in AI, deep learning, machine learning, and what did you do before coming to found your own company? Are you a founder and CEO at this company? 

VC I am. I'm the founder and CEO of the company. Honestly, funny enough, during undergrad, I grew up in India and my undergrad was in a university in India, and I wrote this note to a professor at Carnegie Mellon talking about how she was doing some incredible work in education and she said, “Come over for an internship,” and I was like, “Okay.” Never been to the US before, came over for an internship, and that was the Language Technologies Institute at Carnegie Mellon in 2008. And she was like, “Here we can help these students with a chatbot to teach them about how to learn mathematics with these things.” They weren't even called language models back then. It was super primitive but it was fascinating to me back then to be in the world of NLP and language models and the stuff that it could do. But fast forward to my time before Galileo, I was at Google for a while. I was heading up product management there, and this is like 2017/2018 at this point. And Google was known for AI, but a lot of the customers, especially in the Google Cloud side that were coming in, were these large enterprises who were talking about how they had a ton of unstructured data lying around and no AI to actually service that. Most AI back then was MLOps, so to speak, which meant structured data. That's all you could do. And so my fascination basically came in when I saw this whole new avenue for enterprises to work with AI but on unstructured data using what Google already had, which were these language models called BERT and BERT type models, which were super impressive for the time. They were in the order of millions of parameters, but huge for the time. So my team was one of the fringe teams that was building out applications for enterprises using language models back then, and that's kind of what got me super excited about it and I started the company with two other friends along the same lines.

BP Nice. BERT, that's ‘something something transformers.’ That was one of the first models that Google put out publicly that used the breakthrough transformer architecture. 

VC That's correct. That's correct, exactly. And it was based on transformer architecture, and it was the ‘Attention is All You Need’ paper. And it just led to this whole flood of NLP-based use cases– natural language processing-based use cases– which I was always excited about, but right now it felt like it’d gone into the mainstream, but now I feel like it's even more in the mainstream than it was back then. So when Galileo started out, we weren't the most popular kids in the Silicon Valley garden necessarily working on language model stuff, but me and my co-founders had worked on this stuff for over a decade. 

BP Well, it's good to be early. When was the company founded? How many years have you been chugging away at this? 

VC I left Google in Feb 2021, and that's when we started thinking about mostly research, honestly. To your point before, it is nondeterministic and it always has been with language models, and generative tasks even more. But even before that, it was mostly like, “Does this label make sense or not?” And at Google in AI, you spent so much money on labeling that it was the bane of our existence and we wanted people to just be able to figure out and evaluate the outputs of these models using some kind of metric, so we spent a lot of time on that in 2021 starting February. 

BP I'm going to pass the mic to Ryan, but before I do, I'll just say, starting a company focused on this kind of technology prior to November 2022 means you were ahead of the game. You got in early. Maybe your valuation wasn't as lofty, but sometimes that's a good thing. 

RD It's interesting that you started early and you say that it was all nondeterministic before this, you had markup chains. Even neural nets use kind of a statistical biasing. What do you think is the benefits and downside of that sort of statistical nondeterministic-ness of the LLMs?

VC I mean, there are huge benefits to this, and I'm glad that folks like Andrej Karpathy have been talking about this because his tweets go viral very fast and he talks about the right stuff, I think. And one thing which I really liked that he said was that these models are reasoning engines, but they're also dream machines. And I like that analogy because it's cooking up stuff based on what it's learned before, and that's what leads it to potentially give very smart solutions, and it's as far away from if-else statements as you can go. So I think for super creative tasks, it's great. It's exactly what you want. It's going to come up with something which is better than a blank sheet of paper that you can start to work on top of, which is why people really admire this technology. I think that's the pro of it. The con, obviously, is that for enterprise use cases, and when you actually want to go ahead and not just have a cool prototype that you can show off but actually have an application at scale, you want to reign that in as much as possible, make it follow a certain syllabus. And the way I think of this is it's similar to a child that the world is its oyster, it's learning everything, but then you've got to teach the child, “Don't play with fire,” and, “Don't jump off a cliff.” And you have to have those guardrails in place and eventually it becomes a responsible citizen that's adding value. So I think of it in the same way, but as you're trying to do that, it's basically data that you're feeding in, whether it's fine-tuning that child or it's actually providing context in the form of RAG for that person. 

RD As a father of a three year old, I appreciate the idea of putting guardrails around. If you have any tips on getting him to stop putting the word ‘poopy’ in every sentence, I’d love good prompt engineering for that. So we talked about how it's a dream machine and limiting the sort of scope of its dreams. We talked about hallucinations and confabulations. What are the ways to sort of nip those in the bud?

VC So the way I think of this is, and Ryan, you and I were talking about this before, but increasingly software engineers are becoming data scientists almost because the way you can operate with these systems is through API calls and a vector database, which is less of a database, it's more of just another API call you can make with some context data. And so what you're seeing is that it's very important for people to understand now that it's not just the model. The model is actually this super commoditized, small part of the entire system. It's actually everything else. You have to have the right data, the right prompt, the right embedding model, the right vector store. There's a whole host of things and parameters along the way that you need to keep tweaking to get this just right. So I always urge people whenever I talk to users or builders to think of this as the entire system. What's the whole system that you want to bring together? That's step number one. But step number two is, given that there's a lot of engineers who are coming into the mix and software engineers who are building these quick AI systems, they have to remember that this is still at its heart data science. And so when you're going from zero to one of trying to build a system that's battle-hardened and that can actually give you a good response most of the time, you have to remember the science in data science, which is iteration and trying out different things and throw the kitchen sink in terms of ten different models, five different prompts, et cetera, et cetera, and see which one's working out better. It's a bit painstaking, but you need to go through that in order to get to the other side where you feel like, “Ah, now I know it's this prompt at this model for this use case that's working out really well, and my chunking methodology is just so. My retriever kind of sucks here and so I'm going to try this different embedding model.” It's interesting, but you have to iterate constantly to get to that point. 

RD When you're iterating, how do you determine what's good? I think that's the sort of open problem right now– figuring out if it's a good response.

VC That's a great question and that's kind of the whole evaluation where this comes in, and that goes back to when we started Galileo in 2021 as well. My teams at Google were spending weeks in Google Sheets just looking at the responses from models, and we were like, “Yes, no, maybe,” and there was no quantitative way to determine what's good data quality. And that was one of the first evaluation breakthroughs that we made with a metric for NLP customers. Similarly, what we're seeing now is that half of our team is ML research, so we built out metrics which could detect potential hallucinations. And what I mean by that is it's not a fact-checker. It's actually almost like a lie detector on the model to tell you how confused are you as you're coming up with a response. And again, it's similar to a human being where if you ask me a question about quantum computing, I don't know much about it but I'm going to maybe give you an answer that sounds smart because I read a little bit, but I'm basically BS-ing. But if there was a lie detector on me, you could tell that this person is not super confident, even though the answer sounds about right. So that's one set of metrics that we provide our customers with, which has around 85% correlation with human feedback. So that's a good first line of defense. But on top of that, as a developer, you have to define what ‘good’ means to you for your use case, and then build out custom metrics as well to figure out whether the response is actually a good one or not. Beyond that, Ryan, I feel like you still need subject matter experts in the loop to just look at about 50-100 different responses that there are potential hallucinations and kind of then tweak the system from there. I think the combination of those three things using these automated evaluation techniques as well as human feedback and some kind of custom metrics, that's a very powerful combination to be able to get to a fairly good system.

BP So you mentioned you had a few evals on the NLP side that appealed to customers. Tell me over the last year, how has the industry evolved and what do you find customers are coming to you now asking? 

VC So Ben, it's been super fascinating, because for the first year and a half of our journey, number one, we were the only players in the language model operation space, if you will. And so we had this open field where we were going to these big banks and everyone else and they were like, “Sure, nothing like this exists.” But the evaluation problem was essentially a data quality problem because data was the primary ingredient inside these AI systems. What's happened since then is that all of these customers started coming back to us starting, let's say, around mid-2023 is when they started coming to us and mentioning that, “Look, we're exploring these larger language models, but then the system is becoming more complex. It's not just the data anymore. In fact, we're not even labeling this data, and thank goodness for that because the labeling is a mess. We're just throwing context data here. There's this new thing called RAG and there's these prompts I need to think about. There's the data, there's all of these things, and can you help us now evaluate not just the data, but all of these different bespoke parts of the system and tell us where things are going wrong?” But also now our hair on fire issue is the output from the model is not a classification model anymore that's just giving out a label, it's actually a sequence to sequence model so it's generating stuff for me. And I don't know if it's right or wrong, so can you evaluate the inputs and the outputs? And so we were like, “Okay, let's grow with you and let's build these metrics out, but also a product out such that you can do this really easily.” And that's what led to what we call the generative AI studio that we launched mid last year, which now around 90% of our revenue is just that in the last year. It's been crazy to see the uptick. 

RD You say you can sort of figure out if it's confused. Are there ways to look into the brain of the LLM to read its mind and observe what's going on while it's figuring stuff out? 

VC It depends on the LLM. The good thing with NLP, Ryan, was that almost everything was open source. BERT is open source and all the other variants are open source, and so we could literally look at how it's learning epoch over epoch and as it’s converged towards a label and we could see how it's getting confused and how it's thinking. Versus the challenge sometimes of these closed-source LLMs is that they don't give you anything about how they're learning behind the scenes. That's where it gets a little bit challenging. We overcame that in some ways, and we have a methodology called ChainPoll. We wrote a paper about that last year. However, the good news is that open source is reviving in the world of AI again, and honestly, Meta is doing a great job of spearheading that and Google is following suit with Gemma as well. LLaMa 3 has been huge. So those models on the other hand, you have a lot of access to exactly how it's learning and how it's making those decisions, and that makes it really easy to get a sense for what was the uncertainty. There's an uncertainty score that you could get for every token as it's coming up with its response. You can also get a sense for how perplexed the model was on the input, because sometimes the question is stupid, but sometimes it doesn't understand parts of it. And so the combination of all of those pieces are what we take in as inputs, put that into our complex math that leads to these metrics that we then provide back to our customers. 

BP I'd like to hear a little bit about the product suite and how it works. If somebody comes to you, do you charge them through an API and a usage basis? Are you creating bespoke solutions for them or letting them choose from templates? And what kind of ongoing relationship is this in a SaaS sense versus a transactional relationship that can happen once and then they can move on? 

VC That's a great question. So we've kind of grown with our customers. So initially, most people were just doing POCs and just kind of feeling the waters of Gen AI to see if this is even applicable at scale. Obviously, there's been so much talk about it that I feel like it's a little overblown at this point in terms of that LLMs can do everything. So people had to have a little bit of a ‘come to reality’ moment with what it can do well. So I feel like most of 2023 was that where they wanted to evaluate different models and see what the output is. So the first product we launched, Ben, was called Galileo Evaluate, which is more on the development side to help you debug and iterate as quickly as you can, powered by all of these hallucinations, security and data privacy metrics that we have built out. Moving on from there, we started seeing people going to production more and more, and that's when we launched Galileo Observe, which is a real time monitoring solution which can give you alerts and what have you around all of these different pieces in the system, but also your own custom metrics. What we started to see from there was that customers came back to us and said, “Wait, these bespoke hallucination metrics and security metrics you have, they're actually really good, but by the time my user gets a really bad, harmful response, the cat's out of the bag already and my brand is harmed, my user is harmed, it's terrible for everyone.” And so based on that feedback, we worked with them and we launched a third product inside the Gen AI studio and we're calling it Galileo Protect. It actually launched publicly today. 

BP Congrats. 

VC Thank you. We're super excited about that because that is the real time interception of the model's response so that if it's harmful, it doesn't reach the user.

BP So you can monitor for something toxic or something that has PII, just as generalized examples, and then intercept that. Do you then send it back to the model and generate another response, or what does the end user see once you've encrypted something? 

VC Great question. So what we do there is, so our users are developers and they create a bunch of these rules. The rule could be based on a bunch of things– PII plus hallucination, etc, etc. Once it's triggered, there are certain actions you could take. The simplest action is just completely override this and instead just say something like, “Sorry, I can't answer this question.” That's super simple. The second one is redact. So if it's PII or certain things like that, redact just that information, but send everything else. The third one, which we've seen is very interesting, is allow them to trigger a product workflow accordingly. So as an example, if there's a hacker who's trying to prompt inject your system, and it says very clear malicious prompt and we detect that and the responses were, “Can you just suspend that user?” So can you call an API on your side as a developer and just suspend that user immediately instead of having to do that ad hoc later on. So we've seen many different kinds of actions emerge as a result of this, which has been fascinating.

BP And the user who got to purchase the car for $1. They're not here in good faith. 

RD I think it's a really interesting approach. I think most people use a sort of complex system prompt, but this positive man in the middle attack, this firewall where you catch the prompts before they get to the LLM, what led you to that approach? 

VC Honestly, our customers were increasingly talking about that. They've been mentioning that ever since we showed them our suite of proprietary metrics that we've built out and how powerful they were. We've been working on this for many, many months now, and the issue was, “How do you launch this and make sure that it's super low latency?” It has to be ultra-low, like milliseconds of latency to the system. It also has to be super low cost. And the issue is, if you talk to most teams that are trying to do some kind of evaluation today, they'll tell you that, “Ah, we do XYZ, but we have an LLM in the loop somewhere to evaluate the response of the LLM.” And that doesn't scale, it's very expensive. And so we had to kind of go back to the drawing board, build these new kinds of foundational models that could provide this in a cheap, low cost, low latency way, which could scale and then build the system out. So that's kind of what led to it, but our customers were asking about this since about 10 months ago. 

BP You raise an interesting point which I'd love to dig into a little. I don't know if this is your specialty, but when I look at the state of the art models that are coming out, both from open source and from big corporations, it feels a bit like someone discussing an improvement they've made to an F1 racing car. Maybe it's that much better on the suite of benchmarks we all accept, but it's pretty irrelevant to the average business which still hasn't figured out what the business use case is or how to prevent toxicity or where they're getting value from their Gen AI app. So in that, a lot of people are now considering if a smaller model is better for me or a slightly slower latency because the costs are starting to add up of inference. How do you see people weighing those two things– the power of the model versus the cost and the speed? 

VC Great question. I think you're hitting at a really important point where I think of this as the maturity curve of the market. So way back in 2023, if you went to customers and said, “You don't have to use OpenAI. You don't have to. It's powerful, but you can use a LLaMa model.” I think it was LLaMa 2 at the time, which had already launched. “You can fine tune with your own data. It's not that hard. You have the data. You can own everything end to end, and it'll be cheaper for you to go down that route.” We, in fact, also launched this thing called the Hallucination Index in 2023 to try to really let people know about this, which basically showed how OpenAI is good from a hallucination mitigation perspective to quite an extent, but LLaMa models weren't too far behind. Now, if you read the pros and cons of accuracy to cost there, it's a no-brainer that you should be going with something like a LLaMa model or a Mistrial or something like that. So I agree that every day you see a new model coming out, and I call this ‘model flexing’ by large companies. We saw the exact same thing happen when Google had launched BERT and we started to see DistilBERT and ColBERT and a bunch of others come out. And they're like, “Mine is better because of this one small thing on this help chart that Stanford's come out with.” And there's this really great piece by The New York Times recently called ‘AI Has a Measurement Problem’ that is very good because what we're seeing is that these tests that people are launching their models with can be gamed. And everyone passes with flying colors on these tests, but when our users are actually using these models, they don't quite work really well for all of those use cases. And so now they're stuck in the middle, and again, all the more they need evaluation for this. But I feel like it's harmful for people to just think that here's a new model and therefore it's going to be better than anything that's there before. It's a lot of marketing that's leading to that.

RD I think Hugging Face has a really interesting approach to it where there's leaderboards, which are people evaluating responses. And you can't really game people. 

VC Yes, the Wikipedia model. 

BP All the way down, baby. Forget the MMLU. That thing is a piece of junk. 

VC You’re absolutely right, though. You need crowdsourced information. I really like the Hugging Face leaderboard, honestly. All of these models show up there, otherwise you would never have known about them. But when we talk to enterprises though, Ryan, I've seen that it's mostly that they're following their cloud provider and they're like, “Ah, we use Azure, so therefore OpenAI. We use Google, therefore Gemma.” And I think that's the problem. I wish we were all talking more about, “Here are the 10 alternatives that are way cheaper and equally accurate if you can tweak the system in the right way.” 

BP So, let's say you're talking to a customer and you're saying, “Look, OpenAI is the one you've heard of and ChatGPT is the one recognized as the most powerful, but honestly, you could fine tune on a LLama 2 with your own data, and then not only would it know about your system, which is great, and be in-house, so solve some maybe security or privacy risks, but it would be cheaper.” I'm just curious, what are the costs of self-hosting and self inference versus hitting an API and paying for tokens? Can you lay that out for me in a very ballpark way? 

VC So there's a couple of trade-offs there. One is just ease of use. If you're just getting something out of the box, hit an API, why not? So it depends on how quickly you want to go to market and where you are at. So that's number one. But from a cost perspective, it's a lot because now you're paying for that service of just them having done everything for you. It's white glove, and now you're just calling an OpenAI API call. I don't think it's there yet, unless you're, I guess, a huge hyperscaler. I don't think it's there yet for you to just productionize with that. If you're Twitter, you can't be using OpenAI's models, that's probably why they created their own models. At that scale, you just can't be using that in production. It doesn't make sense. I think that was the whole ‘Oh, shit’ moment that people had in 2023 where they're like, “Oh, we created this thing. It works, but oh, we can't launch this. This is too expensive.” And that's when fine-tuning kind of made a comeback a little bit, and RAG especially made a huge comeback, and vector stores became popular. But in terms of hosting it yourself, there are pros and cons that you need to think about there as well. The issue is, if you're using a LLaMa model, you need to fine-tune that with your own data, and getting ground truth data is still a problem. It was at Google, it still is. The difference is that earlier with NLP we needed, let's say, 10-20,000 rows of data that we needed to annotate with a labeler, versus now it's maybe a thousand at most, much lesser. But you still need to gather that data. Number one, that's hard for a lot of enterprises to do. But number two, again, you have to go and annotate that and hire labeling companies, and that's always a mess. Nobody likes to do that, it's just very expensive. So that's where the pros and cons come in. And so when we talk to customers, we ask them, “You have this alternative, why aren't you going down that route?” And they'll tell you, “I know it's cheaper, but honestly, if I go ahead and get ground truth data, label that, try to figure out if it's good labels or bad labels, et cetera, it's going to take me three months in just doing that versus my competitor is going to build something else before that so I'm just going to go with this off-the-shelf model instead.” 

RD You just throw money at the problem. 

BP It's interesting. From a Stack Overflow perspective, this all makes a lot of sense and I get excited about it because, I don't know if you would agree, but it seems like people have openly said from OpenAI and Facebook, Microsoft Research and Mistral, that data quality trumps all. More than architecture, more than hardware, data quality trumps all. So if you have our product, Stack Overflow for Teams, inside of your company, you've been sort of preceding the work of crowdsourcing and community deciding which data is good, which answer is accurate, adding labels, adding votes, adding metadata to it, and so you might be able to do some of what you're talking about with less of that catch up effort. 

VC That's exactly right. Because now you have all of this interesting data to actually work with, which becomes your proprietary information. And honestly, you can use that very easily with just RAG-based use cases, which is what a lot of companies are doing, honestly. You want to build a chatbot for your developers internally to find the right information quickly? You have all of Stack Overflow's internal data to do that. 

BP I haven't really dabbled in the world of image or video or audio, just text generation and code generation, but in those worlds, what's most useful to your organization usually is understanding either your code base or your knowledge base and then being able to answer your employees’ questions about that. I guess that's a little different from a customer-facing thing which might exist, which is like, “Help my customer create a new marketing template,” or, “You've got to study this FAQ and be our customer service bot.” Those are a little bit different, but for those prime sort of internal AI assistant use cases, it seems like RAG is the best-in-class solution for now, and then having your data preorganized as Q&A couplets that are ranked is a pretty good way to go. 

VC That's true. Although Notion recently published this blog about how they've been going all-in on generative AI for the last year. They partnered with Pinecone, I think it was, and they've gone to production with RAG behind the scenes. And we're seeing that increasingly amongst our customers too, where they're not fine-tuning a model, and it's not either/or, you can do both, but they just built their POCs using a RAG system and they're productionizing with that. It's not super cheap, but it is becoming more the de facto way of going to market increasingly.

BP It seems like there's also some flexibility to that in the sense that you might have to update your vector database, but if every week you want to update what the RAG source of ground truth is, you might be able to do that versus constantly having to think about fine-tuning the model or switching to the next model, whatever it may be.

VC Exactly, exactly. It's more live, almost. I guess the only thing that we've been hearing about in that situation is that people want to know more about the drift that's happening in their context data. Especially if they have a live feed of context data which is powering their RAG, what changed and how can they know about that quickly. And so providing some kind of semantic drift becomes really important at that point. 

BP Great. Well, Vikram, anything you want to talk about that we didn't get to? 

VC I'm glad that we are talking more and more about two things. One is the problem of hallucinations and how we can overcome that, the problem of how there's a huge attack surface that's basically opened up also with Gen AI. There's tons of Reddit threads around how they can hack the system so it's good to talk about that. And also third, the idea that people should move and think more about the dozens of models that they can use, and not just be stuck with thinking that there's one Pangea and one Panacea that can just solve for everything.

[music plays]

BP All right, everybody. It is that time of the show. Awarded two days ago to Petr Janeček: “How to create a null array to empty a list.” A great question saved by a terrific answer with a score of 20 or more and helped around 50,000 people. So some knowledge and some curiosity combined, giving lots of folks the answer they need to solve their coding problems. As always, I am Ben Popper. I'm the Director of Content here at Stack Overflow. Find me on X @BenPopper. Shoot us an email, podcast@stackoverflow.com. We have had guests. We have had topic suggestions. We have had guests suggested by write-in. So all three, tell us what you want to hear on the show and we will listen to you. And if you enjoyed today's show, evaluate it. Leave us a rating and a review, because it really helps. 

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me with podcast ideas, article ideas, snarky comments, you can find me on X @RThorDonovan.

VC Vikram Chatterji, founder and CEO at Galileo. If you want to find me, LinkedIn is probably the best bet. And if you want to know more about Galileo, go to www.rungalileo.io. You can find more information there. Reach out to us. We are friendly people, we'll write back to you immediately and have a quick chat.

BP Terrific. All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]