The Stack Overflow Podcast

A leading ML educator on what you need to know about LLMs

Episode Summary

Machine learning scientist, author, and LLM developer Maxime Labonne talks with Ben and Ryan about his role as lead machine learning scientist, his contributions to the open-source community, the value of retrieval-augmented generation (RAG), and the process of fine-tuning and unfreezing layers in LLMs. The team talks through various challenges and considerations in implementing GenAI, from data quality to integration.

Episode Notes

Check out Maxime’s three-part LLM course.

Part 1 “covers essential knowledge about mathematics, Python, and neural networks.

Part 2 “focuses on building the best possible LLMs using the latest techniques."

Part 3 “focuses on creating LLM-based applications and deploying them.”

Read Maxime’s blog.

Follow Maxime on GitHub or LinkedIn.

Nikhil Wagh earned a Lifeboat badge by explaining how to Efficiently compare two sets in Python.

Episode Transcription

[intro music plays]

Ben Popper Better, faster, stronger AI development with Intel’s Edge AI. Visit intel.com/edgeai to accelerate your AI app development with Intel’s Edge AI resources. Access open source code snippets and guides for YOLOv8 and PaDiM models. To deploy seamlessly, visit intel.com/edgeai now.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. Your hosts are here, Ben Popper and Ryan Donovan. We work on the Content Team at Stack Overflow and we are in the thick of trying to learn all things Gen AI. It's a bit overwhelming out there, new research papers, new models, new modalities every day, so we are lucky today to have a great guest, Maxime Labonne. I hope I pronounced your name correctly– my French is pretty poor. He has been playing around with a lot of this stuff, building with a lot of this stuff, and also creating instructionals and ways for other folks to get involved, I believe. So Maxime, welcome to the podcast.

Maxime Labonne Thank you. Thank you for having me. No worries, your French is terrific.

BP Merci. So tell us Maxime a little bit about yourself, how you got into the world of software and technology, what you do for a day job, and then your passion, it seems, for what's going on in the world of ML.

ML So I started this journey with AI during my PhD. I studied cybersecurity, and during my PhD, it was really machine learning applied to cybersecurity. After that, I decided to continue and expand my field a little bit, so not just about cybersecurity, but also about computer networks in general. That was at Airbus, and then I joined a team at JP Morgan that is more focused on large language models because that was something I was really interested in, and that's also the reason why I made the switch.

BP Cool. Let's talk about some of the stuff you do in public. I found you through Twitter. You were sharing resources that you were putting online for folks to learn themselves about how to train or work with or quantize or do various things with LLMs. So can you tell us what’s some of the stuff you've been producing over the last couple of years or months and who do you hope is out there interacting with this? Is this about contributing to the open source community, to knowledge? Is this about bringing in people to work on projects you're working on? What's it all about?

ML I think that you’re referring to the LLM course. It's a list of curated resources to get into LLMs. It's split into three different roadmaps. There's one about the fundamentals, like what you need to know in terms of math and Python to get into this field. There's another one that is more focused on the science behind building these large language models. And finally, there's one dedicated to engineering which is more about RAG pipelines and more about deployment, so it covers pretty much all the major topics to get into this field. And other than that, I also do different kinds of open source contributions. It can be tools that I make for myself and then I'm like, “Oh, it's really nice so I'm going to release them for other people,” and that's been quite popular. And also creating models using fine-tuning techniques, and more recently merges. That's also a big chunk of my activities now.

BP Nice.

Ryan Donovan Just looking at it, it looks pretty comprehensive, but I appreciate that you start immediately with the math of it, the linear algebra, the calculus. My first experience with neural nets was in a college course and it was just a sum function, and I had just taken Calculus for Poets and no linear algebra, so I had no idea how to comprehend it. How valuable would you say the math is to getting an understanding of LLMs?

ML I would say it's not that useful, actually, and I'm a bit bothered by the fact that it's the first roadmap. I think people should know that if they struggle with something related to math, here are the resources, but please do not start by learning the math and then all the cool stuff because otherwise I don't think that you'll be motivated enough to finish the course. So I don't think it's particularly useful because you can be on RAG pipelines, you can deploy these models, you can do a lot of stuff with these models without requiring any math. Math is more for the research stuff. If you start reading research papers, this is where it becomes useful. But I would say avoid the math until you cannot avoid it anymore.

BP Right. I think it's really interesting. We've been trying to run through sort of a thought process– you're an organization that wants to bring in Gen AI, what are you going to do? Are you going to create your own foundation model? Well then you probably need some people who are experts in math and you're going to have to learn how to do some pretty complicated stuff. It's also going to be expensive. You're going to need specialized people. You're going to need a ton of compute. You're going to need great data and a training run. We don't have numbers, but we were on a phone call earlier this morning with somebody. A big training run for a heavyweight AI company could be $150 million. That's not an investment you want to make as an experiment to see if Gen AI is going to be useful inside of your company. But then to your point, there are other ways to go about it. What was the company that was acquired by Databricks that builds models for people?

ML MosaicML.

BP Yeah, you could go to MosaicML and say, “Here's my AWS bucket. Can you take one of your great models or one of the ones I pick and do it for me?” And they say they'll handle all of that plus the orchestration on the back end. Or you could, if you have a team of good ML data people internally like you do, read a couple of courses and get it going within a playground and see if it works. And then RAG is really the cheat code. You don't have to learn any of this stuff. Get your data organized, do some embedding in a vector database maybe, and then just point it at the right API with some maybe prompt engineering that you've done around it and you can get some pretty interesting results. I don't know, what do you think about that breakdown? Am I missing anything? Are there ways people should look at this in terms of how they want to bring it to their organization that you would say I'm missing something important or you don't agree?

ML No, I think it's a really good overview. I would say that RAG pipelines is probably the first thing that you want to do because you can basically use APIs that already exist like GPT-4 and you do a bit of prompt engineering. You can retrieve some information to add it to the context and then it becomes better automatically. And then if you're serious about it, if you really want to build not just one model but the best model that you can, fine-tuning becomes really, really useful. Sometimes I see a lot of people saying, “Oh, RAG is better than fine-tuning,” or, “Fine-tuning is better than RAG,” but in the end we want to use both at the same time because this is where you are going to maximize the performance.

RD I've heard that the only reason why you'd pick RAG over fine-tuning or not use both is if you want to forget things. Have you run into instances where somebody doesn't actually want to use RAG or shouldn't use RAG or shouldn't use fine-tuning?

ML Not really, to be honest. Maybe you could not use fine-tuning if you really need GPT-4. In that case it can be a bit tricky or costly, but otherwise I would say it's always a good idea no matter what you want to do. It's always better if you add more context to these models, so that's the RAG part, and it's always better if you retrain these models on the task or the domain that you want to apply them to. So it's a bit of a free lunch in terms of performance.

BP And can you break down for me a little, when you say fine-tuning, what does that mean? We were talking to somebody the other day about this and they said fine-tuning versus pre-training. There's a whole spectrum of it and they were saying maybe you just unfreeze one layer or blah, blah, blah, and I don't know what that means. When you say fine-tuning and unfreeze a layer, what are we doing exactly?

ML So the fine-tuning aspect, it's important to mention, as you said, that it happens after the pre-training. So first you pre-train these models and that's what you mentioned earlier. It can be very, very costly because you need a lot of data and a lot of compute. And after this pre-training phase, you get a base model. So it can be Mistral 7B, it can be LLaMa 2, but this is a base model that has been trained to predict the next token in a sequence. It's not like ChatGPT where you interact with it and you can ask questions and it's going to answer your question. It's more like what you have on your phone, the keyboard you have on your phone that can predict the next word. You get the input. So this is very important to understand because the fine-tuning process allows you to go from this base model to a fine-tuned chat model, and in fine-tuning you have two main techniques. The first one is called ‘supervised fine-tuning’ where you're going to retrain your layers. You can freeze some of them because they tend to be very redundant and it's going to be a lot more efficient if you do not retrain the entire model but you retrain only the parts that matter. And to do this instruction fine-tuning, you'll have an instruction data set with pairs of instructions and answers. And when this is done you can have another layer of fine-tuning on top. It's called ‘preference alignments,’ and the preference alignment thing is also called ‘reinforcement learning from human feedback,’ and what it allows you to do is to give preferences to the model. So you give one instruction and two different answers. One is the preferred answer and the other one is the answer that you do not want to see. And this is very useful if you want to censor the model so it doesn't talk about how to make drugs, for example.

BP It's important for people to understand the most cutting edge AI out there is also Mechanical Turk. It was fine-tuned with people going in to say, “Say this, don't say that,” and that's why ChatGPT is such a pleasant experience, right?

ML Absolutely. It's really used quite a lot. And you always need more data sets, more and more data sets to train these models.

BP So I understand the reinforcement learning with human feedback completely and I can understand the idea of saying, “Okay, I want to unfreeze parts of the model and fine-tune it with slightly different data or instructions.” But when you say ‘unfreeze a layer,’ in my mind, a neural network is this deep set of layers of different connections. How do you know which layer corresponds to which part of its understanding or reasoning or intelligence? How do you know which ones to unfreeze and which ones to instruct?

ML I think it's a bit experimental. Now we have recipes and we know which one is the best. A few years ago, I would have said, “Oh, just keep the last layers, the most important ones.” Now what we do is that with the transformer architecture, you have one block that is repeated over and over and over time. And in this block, you have the self attention mechanism and you also have traditional feedforward networks. It turns out that these feedforward networks, you want to freeze them and you want to only retrain very small matrices instead. It's a technique called LOR. And these very small matrices, they're going to approximate the full ranked matrices in your layers. And this technique allows you to be very, very efficient, and instead of retraining 100% of the parameters, you're only going to retrain about 1% of them. So the cost goes down and it's also a lot faster and you do not lose that much accuracy doing that.

RD I've heard of a new trend of merging models. How does that work?

ML That's an additional step. When you've done your pre-training, your supervised fine-tuning, your reinforcements learning from human feedback, it turns out you have half a million models on the Hugging Face hub. What can you do with such a variety, such diversity of models? An answer is, “Oh, we can just take the parameters and merge them together.” And people realize that it produces pretty good models actually. By merging two or more models together, you can get better performance.

BP This is like a Pokémon or something. You just put two together and see what comes out and sometimes it's great?

ML It's a bit of alchemy right now because we do not really have a principle approach of doing it. We just know that it works.

BP Cool.

RD Is it just sort of like getting to a statistical mean with the training or is there something else going on?

ML There are different techniques. So you can have something very simple like averaging the parameters, so you really average the values because there's the same architecture. You're going to use the same architecture, the same type of models, and you can do a linear interpolation between these values. Then, a bit more evolved, you can do a spherical linear interpolation that is going to be a bit nicer and preserve some nice properties in the models. And then you have more advanced techniques that are going to try to retrieve really the most important parameters in each model and combine those instead of combining everything. It works really well in practice.

BP So one of the things that it seems like we're evolving towards is taking the strength of an LLM, which is the ability to be creative– Andrej Karpathy called them dream machines, not search engines– and then adding that in a loop with other agents that have some kind of control net or more symbolic, structured rule based AI. And if you combine those two things before you deliver the output, you can deliver something that is often more accurate. It still can be novel and generated on the fly by the LLM, but it has been sort of fact-checked or critiqued through this chain of thought. What's your take on that trend, and maybe to the degree that you can explain it to a layperson if you were writing a guide to it, how do these other systems help to guide and shape and sort of critique and improve what an LLM does?

ML There's a lot of research and a lot of interest in the kind of process that you describe. I would call it post-processing. You really have an answer, what do you do with this answer, basically? So a way of doing it is using the chain of thought you mentioned. Chain of thought is quite a basic idea. It's a complicated term to just say, “Let's think step by step,” to the model and maybe provide some examples, because we realize that models perform better if you provide the kind of reasoning that you’re expecting from them. So that's one technique, just good prompt engineering to be able to maximize your accuracy. And then you can do a lot of different things. You can do self consistency checks, you can get multiple answers and compare them and maybe you have a voting system. So if you ask the model, “1 + 1 is equal to,” and it answers four times “2,” and once “3,” you say, “Okay, probably 2.” So there's all this thing. And on top of that, you also have grammar sampling, so instead you can constrain the tokens that the model is going to output. And one application, for example, is to play chess. I made a little chess tournament with LLMs and instead of letting them output really anything, I constrained them so they can only play chess moves that are possible at this moment.

BP Legal moves.

ML Yeah, legal moves. Thank you.

BP Another thing that I'm curious about from your perspective is what does it take to get this to a place where a user of this system would feel satisfied with it? We were on a call with someone from Pinecone who works on vector databases and understands this stuff really well and was saying that a lot of people are very excited about this. They maybe fine-tune a model or they do some RAG work and they create an application and then they're disappointed because it doesn't feel as robust as ChatGPT. And I think that should be obvious on its face, but the question that I have, and Ryan just wrote a great little article for us about what is the smallest model you can build that still produces useful output. Is it 7 billion and it does children's stories? Is it 7 billion but has been trained on Stack Overflow questions so its coding ability is pretty good? If you look at this and you wanted to bring this into your organization and you wanted it to be useful, would you say to them, “First of all, you're not going to build this from scratch unless you're a Fortune 500 company,” let's start there. Maybe you're going to buy it off the shelf from somebody who can build this for you, but you still have to integrate it into your tech stack. You're going to use an open source model and fine-tune it on your own data or you're going to build a RAG system. But realistically, how many months is that going to take? How many staff are going to have to work on this? How much money are you going to have to invest? And how much prompt engineering are you going to have to do so that when you finally move this from test to live, customers don't get to buy your car for a dollar? I can't remember some of the other recent examples of things where once these things get out into the wilds it's very easy to find the edge cases and the flaws and the prompt hacks almost immediately.

ML Absolutely. I think it really depends on the use case and the money that you can put into it. The bigger the model, the better the results basically. I'm particularly interested in the 7 billion parameter models because I think this is where it starts to become really interesting. Plus it's very cheap and you can do a lot of stuff with it for less than $10. So it doesn't cost a lot of money to train, to run, or to evaluate. But if I have the budget, I would go for models like Mistral, which is a mixture of experts, or like LLaMa 70b. And then indeed, I think the difficult part is not training. Training is quite easy. There are a lot of great tools to do it like Axolotl. Evaluation can be a bit tricky, but once again, if you do it for a precise purpose, you can probably design a good benchmark. The real pain point is data and to create a high quality data set, and I think this is where you're going to spend the most time, or you should spend the most time, because data is really critical and if it's not of the highest quality, the model will suffer from it.

RD So on the data, with some of those smaller models, they had synthetic data. The Phi model was trained on synthetic textbooks generated by other models and it seemed to work out pretty well. Do you think that at some point we're going to have LLMs all the way down or is there a sort of drawback to just training on pure synthetic data?

ML Synthetic data is a really big trend. Hugging Face also released their own synthetic data set a few days ago and I think that it's really promising. You have great results training from synthetic data. I would say, yes, I'm a big believer in it. About the drawbacks, I don't know. There are also some things that are a bit strange about it. For example, if you ask Phi, “Who designed you?” it’s going to say it's OpenAI, but it's a model made by Microsoft.

BP Ah, same difference.

ML So it's a bit strange. And Phi 2, for example, when you evaluate it, you also get weird results, and that might be due to the fact that it's been trained on synthetic data. Basically it performs better on benchmarks than in real life, which is also a bit sus, but these models are difficult to evaluate in general and maybe this is just a way also for us to consider other benchmarks to capture its performance in a more accurate way.

RD The benchmarking question seems to be a big one where there are the kind of standard human eval stuff, and then there's, is it functional, does it pass unit tests, does it make sense to a person?

BP Does it write unit tests? They can do that now, I heard, too.

RD Do the unit tests past unit tests? But I wonder if there's some sort of converging to a sort of accepted standard for benchmarking.

ML Not at the moment, unfortunately. A lot of people are working on this problem. You have the Open LLM Leaderboard, which is kind of the gold standard of the open source community. Unfortunately a lot of models are contaminated, a lot of data sets contain some test sets, so that's not good. It's mostly truthful QA so that's a tricky problem. And then you can realize that, actually, this is quite narrow, and you have other benchmarks like empty bench that try to capture the performance of the model in a multi-turn conversation, which is absolutely not captured by the other benchmarks. So maybe you have a great model to do math but it cannot have a conversation properly. So right now, I would say try to diversify the benchmarks and see if your model will perform well on all of them, because if it only performs well on one benchmark, it's not going to be a great model overall.

BP So you mentioned data quality. That's something we care a lot about at Stack Overflow. It seems like we're in a lucky spot because we have two things. One– the way the data is organized is almost in a Q&A format already. And then it has this metadata that says, “Well, this one was the most upvoted,” and you can look at this as a signifier of something. It has a recency score and things of that nature. If you were trying to bring in, let's say, all your company's proprietary data to work with a Gen AI system and it was just documentation like a code base and a bunch of wikis and a bunch of FAQs, is there any way for the model to score and understand which of that data is accurate or up to date, or that has to all be human labeled and annotated beforehand? If the model is going to train on all your internal data so that it can do great stuff on it, or even if you're going to do RAG with the internal data and you're going to be pulling from that, how does it know, of the data that it's looking at, what's factual and what's up to date?

ML You can ask GPT-4 and magically solve this issue. But no, I see what you mean. It's particularly striking with Stack Overflow because it immediately produces a high-quality data set where you have instructions and answers. In general, it's not like that. In general, indeed, people have wikis and they don't have instruction, they just have a lot of text and they don't really know what to do with it. So one way of doing it is to continue the pre-training phase so you try to predict the next word again. Otherwise, and probably you want to do that too, it's to reformulate the text from the wiki into questions and answers. And to do that, you're not going to do it manually. It's way too time consuming, so you're going to ask another LLM to do it for you. And then you can also post-process and evaluate these samples to make sure that they have high quality. And to do that, you have a lot of different techniques that are available now, but one very common one is just to ask GPT-4 to score your samples, which is not obvious because it's really bad at producing numerical outputs so you have to be a bit clever in the way you prompt it to say, “At one point, if this is present, at one point, if this is present,” but It works pretty well as a heuristic.

RD I want to go back to the RAG stuff. I think one of the things I saw recently is that, even with RAG, you can run into some issues where the data is contradictory or you have data that doesn't quite give a good final answer. Are there post-processing/extra-processing layers you can add on top of RAG to sort of ensure that it gives a good answer rooted in data and doesn't give an answer if it doesn't have the answer?

ML There are a lot of different techniques that you can try to implement on top of it or you can try to see if you do not retrieve any context. Because when you retrieve context, you have a similarity score to make sure that it's actually close to what you're asking, and maybe you can just set a threshold and if the similarity score is too low, you don't have the right context. If you don't have the right context, either it's in the knowledge base of the LLM, or you can just say, “Hey, we could not retrieve the right context, but here's the answer,” or maybe not output the answer. I think it really depends on the use case. And then you can try to post-process it to evaluate the quality of the answer of the LLM could be a way of doing it. And if you have users, it can also be really nice because you can ask them if the answer is correct or not, and you can build a preference data set that you can later use to fine-tune the model once again based on the user feedback.

BP So Maxime, obviously you understand this stuff really well, even down to the math. You play with it and you build with it. What in your opinion is sort of the next step towards AGI and to what degree does an LLM have artificial intelligence? It's not just a stochastic parrot, it's not just guessing the next token. It can look at and reason over language in ways that are pretty impressive. It has emergent capabilities that it wasn't designed for to play chess or do arithmetic. That is artificial intelligence of some variety and to what degree– I know there's lots of issues with the MMLU or whatever– but if it can be an expert in many, many, many areas where most humans are not that expert, what am I missing here in terms of what it can and can't do compared to us?

ML It's quite limited. There's a lot of things that it cannot do and people like to focus on those to say that they're actually pretty dumb. I think it depends on how you look at the situation, because they're really useful. I use them every day. So I would be very sad if they did not exist anymore. That's one way of seeing it.

BP You think of them as intelligent assistants. They provide you with intelligence when you need it applied somewhere.

ML Yeah, exactly. You can interact with them in natural language and that's something that still surprises me to this day because it's still quite new. On the other hand, there are a lot of tasks where they're quite bad, like math, for example, quite bad at math. I think that the transformer architecture that powers this system since 2017 might not be the best one. Especially if we're talking about AGI, it’s probably not the right architecture to do it. It's just one stepping stone. I'm really curious about the other architectures that we are going to design in the future that can replace this transformer architecture, because I think that there's a lot of things that can be done in this space, and that will probably make models a lot more efficient. You will not need to have such big models to do the same task. And they would be able to also scale better, because right now the scaling loads are pretty brutal. They can throw a lot of compute on the problem, but it's not going to improve the model that much.

BP That's really interesting. I heard somebody discussing this the other day. We can't know what's really going on, but they were sort of saying that we know adding more tokens, the matrix multiplication, it's just a power law and it becomes very expensive. And then Google comes around and says, “Well now, actually, we can do 10 million tokens.” They're not saying how much compute and how much money they're spending on it, but they're saying that the context window will be a million tokens or 10 million tokens, which seems like a total step function change from what we were talking about before. No? You're shaking your head.

ML Yeah, because actually it's a bit unfair, but there was a paper I think from Berkeley that was released a few days before Gemini and this claim, and they also had 1 million tokens in the context window. So actually, we already know how to do it. It's not new, but there's a technique called ‘bring attention’ and you have a multi-step process to gradually increase the size of the context window. So I think this is like a bag of tricks and we learn new tricks all the time and we implement them, and this makes the models a lot better over time.

BP I was using that as an example of some kind of architectural evolution or improvement that's allowing us to suddenly have a model that can be a lot more capable and without, as people maybe thought, some huge increase in cost, for example.

ML No, that's true. We really learned to be a lot more efficient in a lot of things. For example, the attention mechanism that is at the core of the transform architecture used to be a quadratic and now it's linear, thanks to a lot of different improvements. So it really shows that you can just bend the scaling loads quite a lot.

BP Yeah, that's an excellent point. Very cool.

[music plays]

BP All right. Well, I want to say thanks so much for coming on. As we do at the end of every episode, we want to shout out somebody from Stack Overflow who came on and contributed a little knowledge. Thanks to Nikhil who earned a Lifeboat Badge for saving a question with a great answer. “How do I efficiently compare two sets in Python?” Nikhil has the answer for you, and we've helped over 57,000 people, so appreciate it, Nikhil. As always, I am Ben Popper. I'm the Director of Content here at Stack Overflow. Find me on X @BenPopper. If you want to come on the program or you've got questions and suggestions, email us, podcast@stackoverflow.com. And if you liked what you heard, leave us a rating and a review.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find the blog at stackoverflow.blog, and if you want to find me, you can go on X and look for RThorDonovan.

ML Thanks for having me here. If you want to know more about LLMs or are curious about the topic, you can find me on Twitter @MaximeLabonne and also on LinkedIn @MaximeLabonne. Thank you.

BP Awesome. Thanks for listening, and we'll talk to you soon.

[outro music plays]