Ben and Ryan talk with Jonathan Frankle and Abhinav Venigalla of MosaicML, a startup trying to make artificial intelligence efficient and accessible for everyone by lowering the cost, time, and complexity it takes to train a powerful AI model.
MosaicML is a platform for training and deploying large AI models at scale. Explore their docs, check out their blog, and keep an eye on their open roles.
Jonathan Frankle is the Chief Scientist at MosaicML and an incoming Assistant Professor of Computer Science at Harvard.
Abhinav Venigalla is the NLP Architect at MosaicML.
Today’s Lifeboat badge winner is singmotor for rescuing How to remove columns with too many missing values in Python from the dustbin of history.
[intro music plays]
Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, your host here, joined as I often am by my colleague and collaborator, Ryan Donovan, Editor of our blog, commander of the newsletter, occasional podcast host.
Ryan Donovan How are you doing, Ben?
BP I'm okay. So at Stack Overflow we have a ton of focus these days, a whole tiger team internally, thinking about how we can make the most of the recent wave of generative AI models, LLMs– large language models, and what we can do with our data that would help to better serve folks on the public platform who are coming with questions and looking for answers, or help folks who are using our Teams product and have a big knowledge base inside of their company. So that means a lot of considerations of, “Well, how do we train a model? Who should we go with? Which model should we use? Where is our data safe? Should we build vector databases and vector search? And what are the tradeoffs in terms of cost and latency, quality, all kinds of things like that?” So one of the companies that I saw making some news in this space, and obviously being utilized by lots of folks, was Mosaic, and so we wanted to invite them on. Today, we're lucky enough to have two guests with us: Jonathan, who is the Chief Scientist over at Mosaic, and Abhi, who is an NLP Architect. So I wanted to say, welcome, both of you, to the Stack Overflow Podcast.
Jonathan Frankle Thank you for having us.
BP So Jonathan, let's start with you. What brought you into this world? I think a lot of people maybe outside of tech didn't think too much about AI or certainly about chatbots until recently, but how long have you been working in this field and what led you to where you are today?
JF Yeah, so I've been in this field about five years now, maybe a little more than that at this point. And I kind of wandered my way into it during my PhD at MIT a few years ago where I was working on cryptography at the time, and just the idea that you could accomplish tasks that is very hard to actually write a computer program to solve got me very excited– this idea of approximation, the idea that neural networks can do all this fuzzy computation. And I ended up getting my way into Mosaic a little ironically. I got a cold email from our now CEO saying, “Nice paper, want to do a startup?” and that's how I ended up at Mosaic. So definitely check your spam folder, everybody. That's my recommendation to you about having these fun experiences.
BP Okay. Arvixe is the new job board apparently. And so as Chief Scientist, what does your day-to-day look like? Give folks who are listening just a little bit of background on where Mosaic is positioned within this rapidly changing ecosystem and what are you focused on? Are you managing a team? Are you doing research? Are you thinking about how to train the next set of models? What is your day-to-day?
JF So my day-to-day is usually I wake up and then I start going to meetings and then 6 o'clock rolls around and I finally… No, so I supervise our research team. So we have 15 or 20 researchers now focused on trying to understand not only how to train the latest, greatest stuff, but how to do so in a really easy and cost effective way. Our company's slogan is to make deep learning efficient and accessible for everyone, and we really truly mean everyone. That means driving down cost and that means finding paths through the space that anyone can use and be successful. Our overall business to a first approximation you can think of as training large models for contract. We aren't trying to train one model to rule them all. We're a big believer that the world will be dominated by 100,000 models, each with a specialized domain, lots of specialized models and proprietary data doing very specific things. We're not believers in the idea of one model to rule them all. And so with that in mind, our goal is to enable if there are going to be 100,000 models, 100,000 people have to be able to make those models. So cost has to be low enough that people can do this, the tools have to be good enough that people can do this, and the kinds of things that you can get away with when you're at a big research lab when everybody's got a PhD in Machine Learning and you can tolerate some pain, that's not going to cut it at scale. So we think a lot about how to just reduce those barriers to entry and make things work really, really well for everybody.
BP Very cool. And Abhi, how about yourself? What brought you into this world and what do you focus on day-to-day?
Abhi Venigalla So actually out of college I went to a company called Cerebras, researching and working on ML hardware. And I was working with the algorithms team there, thinking about different ways, actually very similar to Mosaic, to reduce the cost of training models using the hardware. I actually had a similarly interesting intro to Mosaic. I met Naveen on Twitter and then kind of had my interview during that whole Covid era of Zoom interviews and stuff. And it ended up being really interesting because a company focused purely on algorithms and research to reduce training using any hardware, like whatever turns out to be best, that was kind of the pitch that I joined on. So the company has grown a lot since then. We started off doing kind of basic efficiency research on lots of different models. Now we're really focused on the end-to-end of helping people actually train their own LLMs that are useful for them. And so a lot of what I work on nowadays is trying to optimize not just the training efficiency of our stack, like, “How many dollars does it take to train a 70 billion parameter model?” but also trying to deliver in a way such that even data scientists at larger corporations who aren't necessarily ML research scientists can still actually train these models on their proprietary data and build private models that are good for them.
RD So when we hear about the newsmaker LLMs, we always hear that they take millions and millions of dollars to train x number of trillion parameters or whatever it is. How do you all actually reduce the costs? Smaller models, is it better algorithms? What's the secret sauce– that you can give away?
AV Yeah, I would say in terms of driving down costs a big portion is actually just a little bit of myth busting and making the software more accessible. About half a year ago, sometime in September/October, we started publishing these kind of tables of, “Here's the real cost to actually train, say, a GPT-3 quality model or a Chinchilla quality model.” And even just using the right software in the right way, it turns out it's not millions of dollars. It's often hundreds of thousands or less. On top of that, our research team does actually work on a lot of the algorithms, basically to figure out if we can actually optimize in fewer steps or less data or smaller models trained longer.
JF Yeah, I think we've put out a lot of pretty astounding looking speed up numbers over the years, everything from ResNet-50 on ImageNet for 7x cheaper, recently did Stable Diffusion 2.0 for I think 6x cheaper than what Stability quoted it at. And I wish I could tell you the actual secret thing that made all that really fast. The answer is 5% here, 5% there. It's really blood, sweat, and tears of just an incremental improvement, an incremental improvement, and it adds up really quickly to get to these huge numbers. And I think you'll see the same thing from us on large language models over the next few months. It's going to be a little of this, a little of that, and the costs are going to drop pretty significantly to get to great capabilities.
BP Because I'm the content guy, I did go check out your blog before this and I loved the thing– train Stable Diffusion for 160,000. Then you put the big, ‘now on sale X’ through it, now train Stable Diffusion for just 50,000. So bringing down the cost between part one and part three of that blog post even.
JF Yeah, it actually went down 25K in the first few days after that blog post. We found a bug that had made things even slower. So it just keeps coming.
AV It's really funny too, every time we publish this, someone ends up reaching out to us either through Twitter or Slack saying, “Hey, did you try this? I found another speed up, and another one.” So it just keeps adding on.
BP All right. So I know, like you said, you can't give away the secret sauce and some of it is just sort of this accretion of small things here and there, but let's go high level for a second. A client comes to you, they work in finance or they work in law or they work in Stack Overflow. We have a big dataset and we want to understand what's the best way to get an in-house model where we feel like our data is secure. And I think one of the other things you said is we own the weight, so we're the one who's going to be able to, if we wanted later, to even port this to another vendor. What are the basic steps we're going to go through to get that done? And without giving away how you do it, where in those steps are you looking to maximize efficiency or cost or speed or various things like that?
JF Yeah, so Abhi has a checklist for this, so there's a legit process.
AV Yeah. So I think a lot of times we try to scope with a customer sort of what are the constraints in terms of inference? Are you looking for what size model? A 70 billion parameter model or a 7 billion or 3 billion or so on? And then also questions of what's your training budget and how much data you have. And we've kind of done this process several times and each time with a new customer you learn more and we just get better and quicker at it. What I say is that we usually try to kind of scale you up towards this kind of largest hero run that you're targeting. And along the way, we're measuring the eval metrics you care about, we're letting you know ahead of time what the costs are going to be, what are the right optimization settings and stuff. So by the time it gets to the actual hero run it pretty much goes smoothly. And we really emphasize this kind of stability with our training. In our latest kind of MPT-7B blog. Thanks to our whole stack, we're actually able to train these models without any human intervention, and that might be a bit out of left field for people who aren't training LLMs every day, but a lot of times the hardware crashes or you have network issues or optimization problems. We spent a lot of time trying to debug all those so you can actually kind of hit enter, go to sleep. Nobody needs to be babysitting the run and it actually gets to where you want to go.
RD I think one of the things a lot of folks are thinking about and we're thinking about too in terms of the proprietary data that people want to train the LLMs on. How do you make sure that that's not visible to you or your team, that it’s protected?
JF It's pretty simple. We'll run it on your GPUs and so you don't have to worry about it. We can set up all of our infrastructure within your cloud VPC, or even on-prem if you really want to. So you don't even have to worry about it. And if you don't have GPUs, we have some of our own that you can use, and the idea is that we never store any of your data. We've worked very hard to write high performance streaming libraries so that your data will get streamed in, trained on the model, the checkpoints will get streamed out. Nothing is stored locally and everything is ephemeral. So even in those scenarios where you really need to use our infrastructure, we've still taken every precaution to make sure your data is safe. I don't want to store your data because that's risky for me. I'd rather you get an awesome model and I have as little involvement as possible.
BP All right, so let me phrase this a different way. You're sort of saying, “Well, it depends on what you want– 3, 7, 70 billion parameters,” and stuff like that. But what if the client doesn't know? One of the things that's been interesting now is seeing as we look at different approaches and what came out of Facebook with LLaMA/Alpaca, a smaller model that uses a pre-trained set from an earlier bigger model can sometimes get to parity. So if the client doesn't know, do you give them a recommendation or do you have a viewpoint on sort of like, “All right, well, based on your dataset and your budget, here's what we would do.” Do you sort of give them a menu of options or some guidance about how they would do it? That's question one. And question two is, what do you make of the really rapid changes in the size, cost, and complexity needed to create a model that can be quite capable because it's standing on the shoulders of giants, if you will– the training that's come before it?
JF To that first question of how do I tell someone how big to go, it's actually really nice. If you start to look at the scaling laws, the cost of training the next step up increases quadratically. So you can kind of climb the ladder. Start with the small one, see if that's useful to you. If you're getting good results, you're seeing some progress and you're seeing return on investment, try the bigger one and spend a little more. And if you're seeing return on investment in that, try the next bigger one and spend a little more. I don't ever want a customer to come to me and say, “I'm going to spend $10 million on this model right off the bat.” What I'd much rather see is to take it one step at a time. Let's run into all the issues at the smallest possible scale we're going to run into them. There are always devils in the details with datasets and evaluation and all sorts of other fun stuff that is the real data science and machine learning problem. And keep going until you decide it's no longer worth it to you. And so far, I don't think we've had a customer who stopped yet. They keep finding value and going bigger. I'm sure that won't be true forever. There is some point at which it's no longer worth it, but stop when you stop seeing return on investment. And hopefully for both of us, it's a long way away because that means we're doing good business and you're getting something worthwhile.
BP And Abhi, just to sort of clarify the second question, I've seen a lot of interesting posts out there of people saying, “Hey look, I got GPT-4 running on a Pixel Phone, on a laptop. I was able to sort of take this thing that when, as Ryan pointed out, the company was originally creating it, required years and tens of millions and all this human reinforcement, but you can make a copy of it, a scaled down version of it, something that was trained on the training of it, and get some interesting results.” So what do you make of that and where's that leading us?
AV Yeah, totally. I think especially with seeing lots of people taking the LLaMa models or taking data trained off of Chat-GPT responses and stuff, and building these smaller versions that work quite well. I think what we're seeing is that people have different independent niches that they care about. And when it comes to those independent niches, you're able to accomplish the quality you want in a much more cheap or economical fashion. You don't necessarily need the one model that can do everything, because most people aren't trying to do everything in their applications and stuff. Some people just want a chatbot assistant. Some people want something that can write stories for them or something like that. I think some challenges that come with all of those parts of the ecosystem, the LLaMA models and even the Chat-GPT data, is that a lot of it is not commercially viable. I think it's really exciting that people are building on top of these, but one of the things that we're trying to do is actually deliver some of these products in commercially licensed ways. So the MPT models for instance– we open sourced those for free and they are commercially viable, almost kind of like a drop in replacement for the LLaMA set. And you can imagine too that, especially with the inference service that we just launched a few weeks ago, we could also potentially deliver some of these things where, yes, you actually can train off of the generated data from our models and stuff like that too. So I think we're really excited about all these directions. We're trying to figure out how we can go from the prototype phase where lots of people are playing at home to actually enterprises using these things.
RD You talked about folks scaling up their models. When you had the difference between a 7 billion parameter and 70 billion parameter, is the change visible? Do new abilities become more accurate? And what exactly are the watermarks where you see that?
JF I think there's a bit of a misconception embedded in that question, which is saying 7 billion or 70 billion doesn't tell you enough information to evaluate the model in any way. It's entirely possible to spend more on a 7 billion parameter model than a 70 billion parameter model, train it for more flops and have a better result. Abhi has started doing this thing where he just describes not the size of the model or even the length of the training run, he just describes it in dollar amounts now. He'll say, “We're going to train a $50,000 model,” and whether you train a 70 billion parameter model for a certain number of steps or a 7 billion parameter model for way more steps, honestly the differences aren't that huge. Generally Chinchilla tells us there's a sweet spot for the right model size for any amount of training, but the penalties aren't that big for going off of that. And so 7B or 70B, now you’ve got to tell me what data was it trained on and how long am I training? If you train a 70B for one step, my 7B that I trained for a trillion steps is going to be way better. So there's context there.
RD That's interesting because when I hear about language models they're always x number of parameters. So why do people think that parameters are valuable metrics?
JF Because they used to be back in the days before we really understood the dynamics of this. And really the LLaMA models changed the conversation a bit by showing us that we could do things that were somewhat suboptimal and still get really great results, but results that are in a much smaller package that we can really use for inference. So I wouldn't call it a paradigm shift, but it was nice for everybody to realize that just because there's something optimal doesn't mean that it's that bad to be suboptimal. And being optimal for one thing– training, can be very suboptimal for another thing– inference.
BP Yeah. I saw an interesting little chart from Andrej Karpathy that was talking about how you would evaluate the utility of a model compared to, say, talking to a human being. And it was kind of like, “Zero shot prompting: asking a stranger a random question. Multi-step prompting: asking an expert a reasonably thought out question. Shot prompting with chain of reasoning: now you're really talking to a college professor who can help you.” And then, this isn't a separate sort of modality, but then they were saying that the amount of fine tuning, the amount of flops, the amount of reinforcement learning with human feedback is going to get you better qualities. So what I think you're saying is that we can't just talk about size anymore, or even flops anymore. There's different things that can lead to a great outcome, and maybe it depends also on what you're looking for– something that's super generalized to the whole universe of a chatbot that knows everything or something that's really good at law and is going to help these lawyers save x number of hours a day. So do y'all have models internally that are sort of yours that you're building that you think are interesting which then people can use in a commercially viable way as opposed to just training other folks’ stuff?
JF Yeah, we do both.
AV Yeah. And I'll just say that the first one of these that you're seeing is kind of the MPT one that we released. So some of our early customers would train their whole models from scratch. You can see Replit doing this for code and stuff. But there's another group of customers who would love to start from a relatively strong starting checkpoint and continue from there, and that's kind of what our 7B-MPT model is.
JF Yeah, I kind of think of the 7B as a bit of a demo track. We train a lot of models for contract which means many of them don't see the light of day. They're being used for internal use cases at a lot of big companies. So how do people know that we can actually train good models? We put out that 7B to say, “Hey look, we're serious.” The 7B I'll say is the baby of the family. We shared that one with the world. It definitely has some bigger, badder siblings that are available for our customers, but really it's a demo. A lot of folks do want to train from scratch. They want to have complete control over the pre-training data. They may disagree with one choice that we've made. Everybody disagrees on these choices because nobody really knows what the right thing to do is legally or just from a quality perspective. And it's nice for us to be able to show the world what we've got and give customers the option.
BP And that one, it's licensed for commercial use, and I think another thing you had said was that it can handle a ton of tokens. So if somebody says, “Well, I want to bring in my company's entire documentation for this codebase or something,” it might be able to handle it, whereas that's not always true for some of the other models that are out there.
JF Yeah, we're built for heavy-duty fine-tuning. I hate the word fine-tuning. I start calling it further pre-training at this point, because when you have a hundred billion tokens, is it really fine tuning at that point? And they're built for long context lengths. We chose to use ALiBi in such a way that basically you can use as long of a context as you can fit on the GPU. I think internally we've played up into the 80,000’s.
RD So I wanted to get your thoughts on some of the speculative stuff I've heard, because every time I open up my Google newsfeed there's people talking about artificial general intelligence or ringing their hands about the singularity. What's y'all's take on the general intelligence speculation? Are we developing sentient computers?
JF Abhi knows that I have very strong opinions on this one, so I will share them and Abhi can walk this back so I don't get tomatoes thrown at my windows. The metaphor that I hear among my friends a lot is that using neural nets to get to AGI is like building a ladder to the moon. Just because we're making progress and getting high in the sky does not mean that that's how you get from Point A to Point B. That is the polite version of what I think of the conversation.
AV Yeah, I feel similar as Jonathan. I'd say a more tractable thing is just that it was getting really hard to evaluate these models, I think. Maybe that's a slightly worrisome thing for me. We've gone to a point where trying to ask general knowledge questions, we're running out of datasets for that. At some point we have to validate our models just by comparing outputs from model A to B. And eventually I guess for model A to B to humans I think we'll have to do soon as well. So I think that part is tricky, but maybe we're not asking challenging enough things. You have to come up with more and more challenging tasks to evaluate them properly. But I don't worry too much about the AGI endpoint.
BP All right, I have one more question. It's interesting to hear you talk about how you've shown off, Jonathan, as you mentioned, sort of the baby, and customers can come in and see other versions. And Abhi had sort of said, “Let's not discuss this in terms of just flops, let’s talk about it in terms of cost.” And when Google showed off their stuff at IO, they had these four models. They were all built on the same one, but they were different sizes for a smartphone versus maybe a home studio setup versus an enterprise customer or something like that. Another thing they mentioned that I've heard can kind of produce interesting jumps in performance behavior and emergent abilities is multimodal. Do y'all work on anything that's multimodal or just LLM and just text-based?
JF Oh, we're all over multimodal. I know that our LLMs tend to get the limelight. We have a stellar computer vision team. You just mentioned Stable Diffusion for under 50K, that's a multimodal model right there. GPT-4 multimodal is text and image to text. This is text and image to image. It's not a far cry from one to the other, to say the least. I don't want to say too much more than that, but we are very lucky that we have phenomenal computer vision researchers and phenomenal NLP researchers, and boy would it be a shame if we didn't work together.
BP Okay, very cool.
AV There is one other thing I think worth mentioning here, which is that I think we're at a bit of a challenging moment for the open source and academic research community. I've seen a lot of amazing labs that in past years shared incredible innovation back and forth and really enabled us to get to this point where we have this incredible technology, labs that I personally spent a lot of time at during my formative years, close down. They're not going to talk about what they do. There aren't a lot of us left in the industry world that are still doing things openly. There's us, maybe our friends at Stability and Together and Hugging Face. I think Mosaic is one of the biggest open industry labs left, and that's because we have 15 or 20 researchers and at least we're still open. So it is a challenging moment and I think it's really important for us that we continue to share what we learn, we continue to put things out there and try to support the community, especially now. As far as I'm concerned, I just think we need to keep in mind how valuable the open source community is, and I hope nobody here underestimates the power of the open source community in academic research. To my friends at some of the bigger labs, don't underestimate the open source world, and it's much easier to be a part of it than to fight against it.
BP Well, that's a good message for Stack Overflow, so we'll end on that.
BP All right, everybody. As we do this time of the show, I want to shout out a member of the community who came on and helped to spread some knowledge. A Lifeboat Badge was awarded to singmotor on May 13th. They came and found a question with a negative score, gave it a new answer, now that answer has a score of 20 or more, and the question has a score of 3 or more. “How do I remove columns with too many missing values in Python?” Well, if this has been a problem for you, singmotor has the answer and has helped over 45,000 people, so appreciate you dropping the knowledge. I am Ben Popper. I am the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. Reach us with questions or suggestions for the podcast, email@example.com. And if you like the show, leave us a rating and review. It really helps.
RD I'm Ryan Donovan. I am the Editor of the blog here at Stack Overflow, which you find at stackoverflow.blog. And if you want to find me on Twitter, I'm @RThorDonovan.
AV I'm Abhi Venigalla. I'm the NLP Architect here at Mosaic. And if you're interested in learning what we do, check out our blog at mosaicml.com.
JF And I'm Jonathan Frankle, I'm Chief Scientist. And what Abhi said.
BP Awesome. All right, everybody. Thanks for listening and we will talk to you soon.
[outro music plays]