The Stack Overflow Podcast

“We’re not worried about compute anymore”: The future of AI models

Episode Summary

Ryan Donovan and Ben Popper sit down with Jamie de Guerre, SVP of Product at Together AI, to discuss the evolving landscape of AI and open-source models. They explore the significance of infrastructure in AI, the differences between open-source and closed-source models, and the ethical considerations surrounding AI technology. Jamie emphasized the importance of leveraging internal data for model training and the need for transparency in AI practices.

Episode Notes

Together AI is a platform for building with open-source and specialized multimodal models. Check out their docs.

Connect with Jamie on LinkedIn.

Shoutout to user aryaxt who earned a Stellar Question badge by wondering about MySQL Data - Best way to implement paging?.

Episode Transcription

[Intro music]

RYAN DONOVAN: The 15th annual Stack Overflow Developer Survey is live, and we want to hear from you. Every voice matters, so share your experience and help us tell the world about technologies and trends that matter to the developer community. Take the 2025 Developer Survey now. The link will be in the show notes.

Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ryan Donovan, your humble host, and I'm joined today by Ben Popper. Welcome Ben.

BEN POPPER: Brian, always a pleasure to be here, love hosting the podcast with you. Today we're gonna be chatting with some folks from Together AI, and there's a few pieces of the conversation I'm particularly interested in– combining different models, the value of open-source sort of in this ecosystem, and what people are thinking about when it comes to infrastructure and model development. You and I were on a call pretty recently with some of the folks from DeepMind Gemini team, and they said something that I hadn't really heard before, which is like, we're not that worried about compute anymore. Like we're coming up with new techniques, we're able to verge in models, and I was surprised by that.

RYAN DONOVAN: Without further ado, let's introduce our guest, Jamie De Guerre. He's Senior VP at Together AI. How are you doing, Jamie?

JAMIE DE GUERRE: Doing great. Ryan, Ben, thank you so much for having me. I'm really excited to be on the podcast today. I'm a big fan.

RYAN DONOVAN: So top of the show, we like to get to know our guests, you know, see how they got into software and technology.

JAMIE DE GUERRE: As we were saying just before the show, I grew up in Canada and I went to university initially thinking I would go into business and then found it for me to be kind of this wishy-washy introduction to everything and not doing anything concrete. And I really loved my computer science class I took in that first year and studied computer science in school and realized that that was my passion and being technical and working with technology. And ended up moving to the States right after university to join Microsoft as my first job. And then spent 10 years working at startups, almost 10 years working at Apple on AI and machine learning related areas. And then, most recently, joined Together AI as the Founding SVP Product with someone I'd worked with for a long time, Vipul Ved Prakash, our CEO, as well as some of our other founders, Chris Ré, who I met at Apple. And it's been a really exciting ride for the last two and a half years here at Together.

RYAN DONOVAN: So in the initial reach out, we talked about connecting together 50 open-source AI models and outperforming commercial LLMs. That's a big claim. So how does that work?

JAMIE DE GUERRE: Together AI was really founded on the premise that in the future there wouldn't really be one big model to rule them all. We felt that with the way that AI research evolves, the AI community would innovate and, more and more, that the model creation process and the ability to get to accuracy on and with these models would come from more and more places. And not only from big closed-source labs who are doing incredible work and are also actually contributing to the open-source community in many cases, but also from open-source labs and open-source researchers and an open-source community.

With that change, essentially, you'd have more and more leading models achieving really high accuracy, and the ecosystem that would build around open-source models would mean that you get more variations and combinations that achieve even higher quality. And I think that we're seeing that today, you know, in the last few months you've had a tremendous number of open-source models released that are by and large at the same level as the leading closed-source models. And Together AI makes all of the leading open models available on our platform.

And increasingly, like you're mentioning, we're seeing customers use them in a combination to optimize for different things. You may want to optimize for the lowest cost, you may want the lowest latency or or fastest performance, you may want the highest accuracy. And by using a set of models with a set of different systems together as an agentic system, we see customers kind of achieving those goals the best.

BEN POPPER: And so when customers are coming to you, they're typically hoping to run a custom model that maybe is using in-house data. And so they don't want to just use something off the shelf or access the big frontier lab models through an API. Like why come to you in the first place, I guess is the question.

JAMIE DE GUERRE: Together AI was built on the goal of providing the fastest performance and best efficiency of any AI platform. You know, as we look at that thesis where the future is not one big model to rule them all, it's many, many models, where does the value accrue in that world? The models essentially become a little bit more commoditized. I don't wanna say totally commoditized because there's incredible work happening on those, but if there's many models that can achieve similar goals and many of them are kind of open-source, we think that a lot of value accrues to the infrastructure and the system that's hosting those models. One of the reasons for that is hosting and operating a generative AI application is not just a little bit more expensive than a traditional service that an enterprise might be hosting or a startup might be hosting.

If you look at sort of the CRUD stack of a web server or a set of web servers instead of databases, and you look at a user request to that stack, compared to a single user request to a generative AI application, it's something like 10,000 to a 100,000 times more expensive. Making it more efficient to leverage that infrastructure and providing faster performance, we think would be incredibly crucial and that's kind of our reason for being.

And so to your question of why companies come to us, you know, as they get past sort of a prototyping stage and experimental stage and they get something that is really getting traction with real usage, they start to run into a lot of challenges: There's challenges with dealing with the performance or rate limits from the model provided. There's challenges with the operating costs. There's challenges with the consistency of accuracy and behavior.

And all of these start to be different reasons that they start to look at having more control over their investment and generative AI. You know, if you're just using a closed-source model off the shelf, you're kind of outsourcing it, you're beholden to whatever happens there, and many times it's fantastic and sometimes it starts to have issues that you want to be able to improve, and the ways you can improve it are limited. You switch to open-source models, you start to invest in having a team that really has expertise around this, and you can better control the outcome, you can better control the accuracy through things like, you know, RAG and fine-tuning.You can better control the model performance through a number of techniques that we help customers to do.

RYAN DONOVAN: These open-source models, I know for some of them they're open-weights and not necessarily open-source, but I'm sort of confused about the difference between it. Like is it possible for a model to actually be open-source fully?

JAMIE DE GUERRE: You know, one of the values of openness is really around research learning. So if you think about what would be truly the most open approach for a model release, from a research learning perspective, it's essentially making everything you did reproducible.

And so the most open model release would be that the weights of the model themselves are open weights, and so anyone can access the model and download it and use it in different environments with a permissible license, actually open data and releasing the data or the data recipe that you used to create that model. So if it was using publicly available data sources, you say the mixture that you used of those data sources. And then third is actually being open-source, and I think that that really refers to the training algorithms that were used. Here is the code that we use to build this model.

And so now as a researcher, I can take this model release holistically and say, I can go and replicate this and improve upon this to continue to spur further research and innovation in the open community. Now, are all of those things necessary to be valuable in the open community? Absolutely not. That's the maximal, most open way to do a model release that really enables the research to reproduce it. But for many organizations, they're not looking to do pre-training of models. They just want to be able to download the model and run it in an environment that they get to choose that environment and have that ownership and control of that model hosting. And so simply being open-weights achieves that as well.

BEN POPPER: So you talked a lot about customers being interested in choice, avoiding vendor lock-in. I think a lot of people think now about choosing a model the same way they think about choosing a cloud provider, maybe going hybrid, being able to shift. On the compute side, however, it seems like there's still really just one player in town, and I was reading through your website announcing some recent work you've done with big name clients like Salesforce and Zoom, and sort of touting the advantages that they might have working on your NVIDIA cluster.

So NVIDIA is the name when it comes to compute, but you also talked about how Together had built infrastructure, kind of like as a startup, what would we want out of AI infrastructure? Because you had done it from the ground up, you were able to do it in a way that was significantly faster and significantly cheaper than some of the big hyperscalers, I won't name names, but you know, four times faster, 11 times cheaper, et cetera, et cetera.

Can you talk a little bit without, you know, revealing the secret sauce, about how you would be faster or cheaper than folks who presumably can afford to be at a much larger scale than you when it comes to compute?

JAMIE DE GUERRE: Together AI has four founders. Three of those were university professors leading research labs at their institutions. And so we are a very research-led company and I think this is very unique from most companies that are providing GPU infrastructure. They tend to be infrastructure companies and maybe financial engineering companies to be able to acquire that compute, but not AI researchers. Today we have, I think, something like 70 Ph.D.s in the company out of a 200 person company. A tremendous portion of our work goes into this area of AI research, specifically on systems optimization. One of our team members is a good example of that. Tri Dao is our Chief Scientist. He was the inventor of a technique called FlashAttention, which is really used by all of the leading AI labs today, and is pretty well respected as one of the biggest speed ups in gen AI model training and inference over the last five years.

And we've really built out a function to build more and more of those internally that we use in our systems that may not always be released in the open. This includes things like, at the lowest level, kernels for things like the attention algorithm or gems or other things that are both used in training and inference.

And by doing faster kernels to do these key operations that are done in the inference or training process, you can make the whole thing much more efficient. Above that, we build our own– what I would refer to as engines. So we built our own inference engine. We don't use off-the-shelf engine like say vLLM or SG Ling. And through building our own inference engine, we've achieved other efficiencies and we've been on the forefront of adopting a lot of techniques that are now starting to become more adopted. Things like disaggregated serving where you separate the GPUs used for decode from the GPUs used for pre-fill during the inference process. At a little bit higher level than that, you start to get into the model itself, or the models that are used.

And so we've been a leader in doing a lot of work around speculative decoding, which is this idea of having a small model that can predict the tokens that the larger model will generate, but do so in a way that the larger model can verify that, in very little time, as opposed to having to actually do the generation task itself. And we believe we have like the best speculative decoding system in the world for doing that. And then at the full model level, we also do a lot of research around quantization and distillation and other techniques that allow you to achieve on par accuracy with a smaller model or a quantized model.

Using FP8 or even IT4’s common technique, but doing so in a way that achieves the same accuracy is quite difficult and we have a lot of research that goes into that. So taken together, all of these techniques mean that we're just working to get the most efficient use of the infrastructure as possible for our customers, giving them faster performance and lower cost.

And then one final thing I'll mention there really quickly is we also work really hard to have a low cost of infrastructure itself. I think that this is quite different than the large hyperscalers. They weren't building those data centers at a time when it was generative AI. They were building those data centers on commodity CPUs and their infrastructures’s set up and optimized in a very different way that we think we can gain efficiencies that makes that lower cost as well.

RYAN DONOVAN: I wanna talk about the lowest level of the chips. I talked to a VC lately at a conference a couple months ago, said that GPUs are not the future of AI processing. That it's gonna be ASICs or field-programmable gate arrays, like custom chips tuned to each individual model. Is this something that you've heard about, you've thought about– any research around that?

JAMIE DE GUERRE: This is something we think a lot about and by and large our approach is that we will use whatever technology gives the best cost efficiency, cost performance. Today, we still feel that that is NVIDIA. There are FPGA's custom silicon that can give you very high token generation speed for something like inference.

However, one, the cost efficiency is not there, we don't think. And two, as you scale to really large models, or as you scale to new models coming out, it takes a long time to adapt, to be able to run those models, or in some cases, they're not able to run them at all. So for example, when Deep Seek R1 came out, a lot of the providers using custom ASICs are not able to host the model.

They're instead hosting a Llama 70B model that was distilled from the R1 model. And what we've seen is that achieves very different accuracy characteristics, very different quality of characteristics, even within transformers. Like, you know, these are still transformer-based models. They then are not able to be able to run these larger models in many cases or different architectures of models.

We also don't feel confident that transformers are the future. We think that there are likely going to be different architectures at the lowest level of these models that achieve better performance characteristics with high accuracy over time. And the custom, ASICs and FPGAs are often being built to be transformer specific.

BEN POPPER: Training models relies on vast amounts of data, and it seems like we may be coming up against the limits of what's publicly available. You work with lots of clients who I'm sure have industry specific use cases or internal company specific use cases, and who may then be able to train against that internal knowledge base.

What's your perspective on the idea of model training being data-constrained versus compute-constrained going forward, and how companies could possibly leverage their own internal data to great effect.

JAMIE DE GUERRE: Yeah, I think it's a great insight that you're pointing to here. The pre-training process of these models is done on really massive amounts of data, kind of the whole internet.

The next stage, sort of the post-training or even inference-time compute requires relatively much smaller data. A lot of organizations have a wealth of private data inside of their organization that is really important to the application that they want the AI application to serve. And being able to take really large pre-trained model and then customize and fine-tune it or post-train it with that internal data, it is a really significant opportunity. The large foundation models from the closed-model providers do allow you to fine-tune, but it's a very small level of fine-tuning. They sort of have thousands of examples and they create a LoRA adapter that sits next to the main model for efficient hosting, but there's a very different type of post-training that is doing hundreds of billions of tokens, as opposed to these small things that these organizations have internally.

And if they can take that data to bear and customize a model with that, as well as deploying it with RAG to give access to that at the runtime, we see organizations being able to achieve higher overall accuracy on the task that they want the application to work in than using these large closed models.

So I think that there's gonna be a tremendous amount of that over time. I think organizations that see generative AI as a strategic imperative to their business are going to invest in having some expertise in-house around AI. They're going to invest in sort of building the muscle of creating the ideal AI system for their task, and that's going to not just include prompt engineering, it's going to include RAG, it's gonna include fine-tuning, it's going to include using multiple models for different purposes in an agentic system. I think a lot of that will get built around open-source models so that they have more control and ultimately ownership of the resulting model created.

RYAN DONOVAN: You talked about the enterprise data, that's a reason why a lot of folks go with these open-source models, right? They want to make sure their data doesn't get exfiltrated into some closed-source model.

When you're hosting these models, what sort of, what do you think about keep that data from getting out of the model or getting that model to leak out somewhere. How do you protect their enterprise data?

JAMIE DE GUERRE: For those types of enterprise customers, we do private hosting of the model with private access. And so, the model's not accessible by other organizations or or developers. We also commit to not retaining any information. The prompts from the end users are not stored. The outputs from the model that it generates in response to requests are not stored, and we also have options to deploy it, even in a non-prem or VPC type environment where that data's not even leaving their environment essentially.

BEN POPPER: One thing that we wanted to touch on a little bit was some of the ethical dimensions of AI. How can companies leverage AI as a strategic asset, you know, while also considering the ethics of it. At Stack Overflow, there have been multiple sort of public blogs and pronouncements about like what this looks like in a human-centric way. If we have a community creating knowledge and we have big frontier labs training on that knowledge, how do we make sure that there's sort of a value chain where everybody can feel like we're creating a healthy ecosystem? How do you think about that? I guess both inside of Together and do you have conversations like that with your customers?

JAMIE DE GUERRE: We do have conversations about this inside Together. Some of my colleagues are focused on this more than I do. Our Chief of Research is quite passionate about it, Ce Zhang. I know our CEO, Vipul, is also in the community talking about it. I think that the frontier labs are also, you know, starting to do work in this area, like licensing data from say Stack Overflow, and creating new paradigms to make sure that this is thinking through all parts of the value chain that go into creating these models.

I think that more is needed here from both the industry and potentially regulations. I think this is an area where that can be helpful. I think that the industry's starting to adopt solutions where publishers of data can be opting in or opting out of that data being included in model training and then that being obeyed by the model training, I think is important, and I think that there's some solutions sort of starting to be articulated for that.

I think this is also an area where the fact that there's more transparency into how a model is created in open-source gives organizations the ability to audit it more. And so one of the things we talk a lot with these organizations about is as they have a model review board, as they're considering the different open-source models, we really work to make all the information available to them about how that model is created.

And with these open-source models, there's usually a paper that is saying, you know, here's the data that we used, here's the approach we used to training it and other things that allows them to audit that. And then one final thing I'll say that we've also done is we've tried to publish a data set that gives organizations flexibility and lots of different signals.

On how they select slices of data that are available from open data sources and create policies around if they're gonna pull in data from public sources into their training having the signals on those data sets available to help them choose which data sets are appropriate to include and not include is something we've worked with a dataset that we call the “Red Pajama Data Set” and lets organizations sort of choose slices of data based on different metadata and signals around it.

RYAN DONOVAN: Is that a Llama4 a llama joke there?

JAMIE DE GUERRE: It was a Llama joke. This is going back to Llama1, even actually. So we first chose that name “Red Pajama” when Llama1 came out and Llama1 was actually quite closed. It wasn't a fully open-source model or open-weights model. And we wanted to help the community replicate Llama1's capabilities in the open.

So we created this data set called “Red Pajama” back in, I guess early 2023 maybe. We've kind of continued to support it since then and expand it. And for those that don't know, there's a children's book about a llama in red pajamas that–and several of us have kids. Llama Llama Red Pajama, and there's, there's great rap songs on it on the internet if you haven't seen those as well kind of famous people reading the book basically as song.

RYAN DONOVAN: What are you excited and or hopeful about for the future of open-source AI or generative AI in general?

JAMIE DE GUERRE: One thing to keep in mind with it is we're in such early days. Most organizations do not have generative AI deployed at scale for some main part of their focus as a company. And I think over the next few years we're going to see that change massively. So I'm really excited for the adoption of generative AI. Along that path to it getting deployed at scale for lots and lots of applications. I think one of the biggest shifts we're gonna see is more importance on sort of the system it is hosting the models as opposed to the model itself.

You know, today there's so much conversation on which model it is we're using, and even when you do that, when you talk about something like OpenAI versus Anthropic or Gemini, you're increasingly not comparing models. You're increasingly between those three, you're comparing a full system. OpenAI has done a great job recently to release the ability for their thinking models, to use tools live in the interaction you use the models. And that is not a model itself, that's a set of tools. It's a whole system for how the model knows how to use those tools and other things. And Anthropic has done a great job with the artifacts capability, and so increasingly, I think that there's going to be a lot of importance on the whole system. Call it agentic system, call it AI system, whatever you want.

This is an area we're investing a lot in Together AI, moving beyond just the single model notion. Thinking about the RAG system involved, thinking about things like code evaluation, and having a lightweight virtual machine that's a secure environment to evaluate code generated by the model and then change your fonts based upon that, computer use, and many more things. So this is, I think, the next couple of years, a lot of where the focus will go.

[Outro music]

RYAN DONOVAN: All right, everyone, it is that time of the show where we shout out somebody who came onto Stack Overflow, shared some knowledge, dropped some curiosity. Today we're shouting out the winner of a “Stellar Question” badge. Somebody who asked a question that was so good that a hundred users saved it. Today's winner is Aryaxt– not sure how to pronounce that. But the question was “MySQL Data– What's the best way to implement paging?” If you are curious, I'll put in the show notes.

I am Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you liked what you heard or didn't like what you heard email us at podcast@stackoverflow.com, and if you wanna reach out to me, can find me on LinkedIn.

BEN POPPER: Hey everybody, I am Ben Popper, one of the hosts of the Stack Overflow Podcast. Here you can find me on LinkedIn, shoot me a message or hit me up on X at Ben popper.

JAMIE DE GUERRE: Hi everyone, Jamie de Guerre. I was Founding SVP of Product at TogetherAI. You can find me on Twitter at JamieDG. And if you haven't tried out TogetherAI, I encourage you to try it. Just sign up at Api.together.ai and would love to hear your feedback.

RYAN DONOVAN: All right. Thanks everyone for listening, and we'll talk to you next time.

[Outro music]