The Stack Overflow Podcast

Cloudflare Workers have a new skill: AI inference-as-a-service

Episode Summary

Rita Kozlov, Senior Director of Product at Cloudflare, joins Ben, Ryan, and veteran cohost Cassidy Williams for a conversation about Cloudflare’s new AI service, what her day-to-day is like, and the mind-blowing “physicality” of the internet.

Episode Notes

Cloudflare is a cloud provider used by almost 20% of all websites. Developers new to Cloudflare can get started here.

Cloudflare recently launched Workers AI, an open, pay-as-you-go AI inference-as-a-service platform that lets developers run machine learning models on the Cloudflare network from their own code. Developers can get started here.

On a related note, read Ryan’s article exploring the infrastructure and code behind edge functions or check out his conversation with Vercel CTO Malte Ubl.

Retrieval augmented generation (RAG) is a strategy that helps address both LLM hallucinations and out-of-date training data.

Connect with Rita on LinkedIn.

Connect with Cassidy through her website.

Shoutout to Stack Overflow user Bamieh, whose answer to What does the function call app.use(cors()) do? earned them a Lifeboat badge.

Episode Transcription

[intro music plays]

Ben Popper Imagine a workstation where your devices seem to disappear, keeping you in a state of flow for hours. Imagine a superior typing experience and a mouse crafted for comfort. Now add smart illumination, programmable hotkeys, smart software, and connection to up to three devices. Discover MX Master Series. Crafted for performance, designed for coders. Find out more on logitech.com.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague and collaborator, Ryan Donovan, Editor of our blog, and back in the house, our favorite hostess with the mostess, Cassidy Williams. Hey, Cassidy.

Cassidy Williams Hello! I'm glad to be here.

BP Yeah, it's been too long since we've had you on the show. I'm glad you're here. So today we are going to be chatting with Rita Kozlov, who is a Senior Product Director over at Cloudflare. If you haven't heard of them, which I find hard to believe, it runs 20 percent of the web and has a lot of insight into what's going on with global internet activity. So without further ado, Rita, welcome to the Stack Overflow Podcast.

Rita Kozlov Thank you, thank you. Thank you so much for having me.

BP Of course, our pleasure. So for folks who are listening, tell them just a quick background. What was your journey into the world of software and technology, and how'd you find yourself in the role you're at today?

RK Sure. So originally, like many people in high school, I had very different ambitions. I was going to study international affairs and my parents tried to talk me into, “Okay, just take a computer science class. It'll be fun.” And so I ended up taking more than a class and studying computer science and started out my career originally in software engineering, which was really great, but I found myself constantly being the one engineer that's chit chatting with everyone and everyone was getting a little probably annoyed with. And for me, I felt like I wasn't utilizing my entire skill set because I love solving technical challenges and problems, but I also love spending time talking to customers and talking to people. And someone on my team had suggested checking out a role called a solutions engineer, which I actually never heard about, and they were like, “Well, you'll get to talk to customers, but you still get to be the representative of the technical view and guide them in the right direction and use your technical acumen that way.” So I thought that that was really interesting and I started looking at positions or places that had that position and that was how I originally landed at Cloudflare. And Cloudflare was in an interesting point in its lifetime at that point actually, where we were growing really quickly, we were onboarding for the first time some of our really, really large customers, and every customer came with their own customization of what they wanted. They wanted to cache things, but only if this header is available, and then cache it but redirect it to this particular country or region. And so, like every company that size, the way that we solved it was there was a gnarly if statement somewhere as a part of our edge code that enabled this use case and we wanted to make that a lot more scalable and allow customers to self service themselves. So we released a product called Workers which allows any developer out there to be able to run code that gets deployed directly to our network. And coming from a software engineering background, the second I tried it and was able to get code up and running without ever having to touch an Nginx config or get a server running or anything like that, I thought this was the coolest thing ever and I was just like, “I want to work on this.” So that was how I started in product at Cloudflare and it's been an amazing journey working on Workers and growing our developer platform since then.

BP Wonderful.

Ryan Donovan It's interesting, I wrote an article a while back about edge functions and the infrastructure they run on and talked to Vercel and they run on Cloudflare Workers. So can you tell people a little bit about how Cloudflare Workers work and where do they actually live?

RK Absolutely. So if you've used a service that operates functions as a service before like Lambda or some folks are familiar with Vercel or Netlify functions, Workers works very similarly. So you write your code, whether that's in our built-in IDE or locally, and you deploy it to Cloudflare. The only difference is that generally when you deploy one of these functions, it will get deployed in a particular region that you have to specify. So if you're using a traditional cloud provider, probably the first thing that you do is you choose US-East-1 and your function will get deployed there. And under the hood, the way that it works is a container will get spun up every single time you make a request. So the way that Workers works a little bit differently there is, first of all, it gets distributed globally. So if you make a request in Perth, Australia, your Workers is going to run in Perth, Australia, and so that makes things go really, really fast, regardless of where users are. Whereas typically, if you have a function that's running in US-East and you have a user in Australia, you're going to pay that round trip latency. So that's one difference. The second difference is under the hood, instead of running containers, we run something called V8 isolates, which is very, very similar technology to what you're using if you're using a Chrome browser right now and have a bunch of tabs open, which means that they're very, very lightweight. So things like cold starts are non-existent, they just run a lot faster, but the model I think would be very familiar to a lot of people.

BP Sorry, how do you know I have so many tabs open? Can you see my screen right now?

RK Yeah, you're actually sharing it.

BP Uh-oh. Sorry about that.

CW That's really embarrassing. I think that that technology is so cool, but it also really makes you remember that there are physical machines running the internet out there. I feel like a lot of times you kind of forget that as you're developing and stuff. You're just like, “Ah yes, the cloud. It's out there. I just have to deploy it and ship it.” But when you think about reducing that latency means actually physically being close to your users, it's both a step forward and a step backward mentally for me where I'm just like, “Wow, machines should probably be close by. What a concept.”

BP And then you think about all the work the big tech companies did to lay undersea cable and put all these points of presence everywhere and it's just overwhelming.

RK The physicality of the internet blows my mind. Since the moment I started working at Cloudflare, the fact that we're able to record this podcast right now across so many different locations and there's just wires connecting us, I still can't entirely wrap my head around it, even though that's what I work on.

BP I know. It's so weird because when there is a weird delay that makes the conversation a little difficult, I'm like, “Oh, this is a bad day.” Most days I'm talking to someone in India or Australia and the conversation can flow pretty seamlessly. So I think that covers a lot of nice basics of how you got to where you are and what Cloudflare does. Did you talk a little bit about what your day-to-day is?

RK Let's see. My day-to-day can be wildly different. So I got back a couple weeks ago from two weeks on the road where I was traveling around Europe and talking to a bunch of our developer customers there. So I was spending an entire day at a customer onsite talking about what's their current stack, what are parts of their stack that would be interesting to move to Cloudflare, and having those conversations. So a day can take that form. Other days I'll be spending time with our product and engineering teams figuring out what are the most important initiatives, what do we need to be building next, what are things that people are blocked on and how do we unblock them? So a day can be any mix and match of those things. And I love spending time with my product team as well. We have really great product managers, so also just being people's manager and guiding them through their own personal day-to-days.

BP Nice.

RD I wanted to follow on that physicality of the internet. When I started looking into it, it always seemed like somebody was building on somebody else. Does Cloudflare actually have data centers somewhere or is there somebody else who owns the metal?

RK Cloudflare has actual data centers in over 300 cities around the world. I think it is a common assumption these days and a question that we get is, “Oh, you guys must be built on top of AWS,” but I can reassure you that that's not the case.

CW It's wild that you could go into a data center and be like, “Wow, the internet's here.” What a concept.

RK A great reminder to me of the physicality of the internet recently was when we just released our new AI offering– Workers AI –which allows our developers to run models on top of GPUs running on Cloudflare. And so as we were getting ready to release this, there were SREs flying around the globe with suitcases with GPU cards in them, dropping off those GPUs in different locations around the world.

BP Woah, that sounds like a James Bond film. I like that. Let's dive into it. You have this new product. Does it have a name?

RK It's called Workers AI.

BP Workers AI. So I think that folks are very familiar with the idea of cloud computing. I'm anywhere in the world, but I can call on one of these great cloud providers to provide the back end to make sure everything is really fast. My company needs to scale; we can do that with them. How is that different in the era of a full stack AI application? What are the new things, GPUs is one, obviously, that you need to provide to customers, and how did you try to build that into AI Workers?

RK So one of the things that has been interesting to us– we were talking about Workers before and this concept of running serverless compute. So the idea is that, as a developer, I don't need to know very much about operating systems or how to configure servers or even manage capacity. I write my code, and ideally, as I get more traffic, things scale up. If I get less traffic, they scale down. And what we found was that so far in the AI world, things were very much provisioned the way that we were provisioning general applications about 10 years ago, or before the concept of serverless became a thing. So if you want to run your own models today, you really have to think in terms of VMs and go provision things in terms of, “Okay, well, tomorrow I think I might have a thousand users that are making AI queries at the same time, and so I'm probably going to need to provision, let's say, 100 GPUs for that.” And when it becomes nighttime and you're no longer getting that traffic, you still have 100 GPUs that are sitting there, they're just not doing anything. So what we wanted to do was to bring the serverless idea into the world of AI and GPUs and allow customers to very much do the same thing that we see them doing with Workers today, which is run the models that you need to run in order to be able to build the application that you're trying to build, and we will take care of the rest of it. And what's been really interesting to see is the demand side of it in terms of the fact that every company right now is thinking about how do I add AI into my applications. Paradigms are shifting very, very quickly and so we just view that as something that is now an essential part of the stack.

RD AI has such a data back end. How does serverless apply to that? Is it still stateless? Is there some sort of serverless database paradigm with it?

RK That's a really, really great question. So everything definitely has its analogies, and so generally, if you make a query to AI, it is going to be stateless. And so if you want to ask it about your own suite of products, you will have to give it all of the information about your suite of products before you ask your question, which obviously is highly inefficient. And so in the world of AI, the equivalent of a database is a vector database, and so that's something that we actually released alongside Workers AI. Our solution for it is called Vectorize and basically what it allows you to do is you can take your product database index set using AI and then save that to a vector database so that when you ask, “What all is available that's a blue shirt?” it can instantly answer, “Well, we have our Ryan variety of shirt, we have this other shirt,” and give you all of the options out of that that match your request.

CW That's so cool. And for the underlying models of that that train on there, is that something that was built in house or does it use open source solutions for that?

RK So we use open source models today that are available through Cloudflare's model catalog. So you can get started with those and literally a couple lines of code. And we've actually been really excited about all of the development that's been happening with open source models. So thanks to Meta, anyone can have access to basically their own personal LLM now. Their model is called LLaMa 2 –which also brings me great joy because I love llamas– and we've been really excited to work with Hugging Face as well. It's been making so many other open source models available to so many folks out there.

BP Nice. I saw Nvidia was also listed as a partner. Obviously, you were just talking about GPUs. I have two questions: one is, can you give us just a back of the envelope, ballpark, you were mentioning you have a thousand customers on your new AI app. Ask the AI a question about these historical figures, and it will respond as if it was Abraham Lincoln or whatever. And then you don't want those GPUs idling at night, but you're paying for them. So how do you think about compute cost in an era of GPUs? Is it flops? Is it seconds? And what does it cost? I'm just really curious how people make that economic calculation.

RK It's been interesting talking to a lot of customers. I think, first of all, everyone is still narrowing in on how they're thinking about it. In my experience, I talked to maybe three different customers today and each of them was thinking about it differently. So if you're using OpenAI, the way that you think about it is tokens, and tokens roughly translate to words, but not exactly. So if a word can have two different meanings, for example, and is used twice, then those are kind of two separate words. But that's for language models, so we allow you to run more than language models. We allow speech to text models, image to text models, image generation models in the future, and so that didn't seem like quite the right metric to us either. And the other model out there is counting things in terms of hours, but generally the way that people understand hours of compute is again in very much the VM model where an hour is an hour regardless of whether you're using things or not. And so we came up with our own little metric that we use called neurons that corresponds to basically how long an execution takes. But at the end of the month, you only pay for how many neurons you use rather than for idle compute sitting around. So if you had 100 inferences that happened today, and then not every feature you launch is successful, so after launch day, maybe you had zero requests, you pay for a hundred total in that month rather than having a fleet of VMs sitting idle.

BP Yeah, that makes sense. And just so I understand, most of what folks are relying on you now for is the inference part. Somebody asks a question of a model that's been trained, and then the model gives a response. Is some of what they're also asking an AI worker for training or fine tuning or some other aspect of how Gen AI works?

RK Yes, so we definitely get requests for both. I think as far as training goes, that is one workload that doesn't really make sense to run at a distributed network. You do want that to run on really, really large, and probably a lot of GPUs at once to make this highly efficient, highly parallelizable. Those are workloads that are going to be more well-suited for centralized clouds. Fine tuning, though, I think is an interesting question, and the answer here is a bit twofold. So there's two ways to kind of get your models to be personalized: one is through having a vector database and what's commonly referred to as a RAG architecture, and so that's a way to give AI context on your particular use case. For most use cases, we find that to be a much more effective way of having a model be customized to your need, but there are going to be certain instances where, for example, you want the tone to be custom to you, or there's certain jargon that you want the AI to be aware of and so that's where fine tuning comes into play. We do plan to support fine tuning eventually. We don't right now, and actually one of the really interesting things for us is that at all times– so right now it's daytime in the US as we're recording this, and so our GPUs are going to see a lot more traffic, but later it's going to be nighttime and the US is going to be quiet. And so for these non-latency sensitive tasks like fine tuning, we can actually use these idle GPUs and do it in a much more cost efficient way because it's effectively otherwise unused compute.

BP Nice. I'll hop in here and just say for those of you who don't know RAG, which is an acronym that pops up every day for us now– Retrieval Augmented Generation, you're asking the model, “Hey, I have a question about code, but don't look at everything you've read across the Internet. Just look at Stack Overflow questions with accepted answers and then send me back that answer in XYZ format or send it back as JSON,” or whatever. We'll throw a few links in the show notes. We've written about RAG a bunch.

CW I think just the most exciting thing about this is just how accessible it's made all of this type of development. Cloudflare has made things accessible in general, where I remember a friend of mine shipped something to it for the first time and she literally said, “Why did I get a computer science degree? This was so easy. I didn't have to configure anything. It just shipped.” And so the fact that it's enterprise grade stuff where normally in previous years, it would take quite a big business to be able to build such large AI-driven applications or just any applications in general. Now, anybody can just try it and ship it and see how it goes. And I just wanted to toss out a kudos to the team for making that type of stuff possible.

RK I mean, that's music to our ears. Our team tries really hard to make these. As you said, it's really hard to think about things in terms of internet scale and wrap your head around that, and so we try as much as possible to abstract that away and make that as straightforward as possible. So it's really great to hear that type of positive feedback.

RD So we've talked to a few folks lately who are working on edge devices and IOT with AI on it– smart cameras, self-driving cars, et cetera, and a lot of that is inferencing. Can you walk us through how that sort of inferencing works with the Cloudflare AI?

RK Yeah, absolutely. So I think a really interesting question is just in general where AI is going to run in the long term, and I do think that it's probably going to be somewhere on the spectrum of device, edge, and centralized cloud. And so if, for example, you have an iPhone, and I love to show people pictures of my cat, so if you go into your photos and look up ‘cat,’ it'll pull up cat pictures. That’s an inference that's running directly on your device. But Apple devices, for example, are very much designed around keeping everything local as much as possible, but also you charge your iPhone every night before you go to bed and a similar thing with your watch. There are a lot of devices out there that do count on not being recharged for maybe weeks at a time, especially when it comes to industry, you might have devices out in the field. What we found with IOT is that there are generally two constraints. There's battery power constraint, and then there's also just how powerful are they. You're not going to put an entire GPU into every single device. And so that's where being able to run things at the edge, so as close to the edge device as possible but not on the device itself, becomes really, really useful because you're able to update the software a lot more frequently. You're able to conserve the battery power, especially since the alternative is connecting to a centralized cloud, but that might take 200 milliseconds as well so you're not keeping that connection alive for a super long time, and you're still getting the best of both worlds. And then I do think that there are certain models that are the really, really large models that also might make the most sense to run in a centralized cloud location.

BP I think it depends if you have an application like a chatbot where people are willing to wait two or three seconds to get their response, that's one thing. If you have a smart camera or self-driving car that needs to make a decision quickly without too much latency, maybe just for safety reasons or customer satisfaction. You have to think I want things in my city or how big is this city and how many users does this city have and then sort of plan your points of presence accordingly, but it sounds like Cloudflare has a lot of experience working on the metal side, so hopefully no self-driving car accidents.

[music plays]

BP All right, everybody. It is that time of the show. I want to shout out a user who came on Stack Overflow and helped to spread a little knowledge. Awarded November 3rd to Bamieh, “What does the function call app.use(cors()) do?” Did I say that right, Cassidy? Cors in parentheses?

CW Yeah, cors. Cors is a curse.

BP Bamieh came along and supplied an answer that was accepted after six years, and we've helped over 12,000 people, so appreciate that and congrats on your Lifeboat Badge. I am Ben Popper, I'm the Director of Content here at Stack Overflow. You can always find me on X @ BenPopper. If you have questions or suggestions for the podcast, email us: podcast@stackoverflow.com. And if you like what you hear, leave us a rating and a review. That would be very kind.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me, you can find me on X @RThorDonovan.

CW I'm Cassidy Williams. I'm CTO over at Contenda. You can find me at @Cassidoo on most things.

RK I'm Rita. I'm Director of Product at Cloudflare. You can find me on Twitter @RitaKozlov_ and I encourage you to check out Workers AI. If you want to run your AI workloads on serverless GPUs, you can do so by going to ai.cloudflare.com.

BP Sweet. All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]