The Stack Overflow Podcast

Looking under the hood of multimodal AI

Episode Summary

Ryan chats with Russ d’Sa, cofounder and CEO of LiveKit, about multimodal AI and the technology that makes it possible. They talk through the tech stack required, including the use of WebRTC and UDP protocols for real-time audio and video streaming. They also explore the big challenges involved in ensuring privacy and security in streaming data, namely end-to-end encryption and obfuscation.

Episode Notes

Multimodal AI combines different modalities—audio, video, text, etc.—to enable more humanlike engagement and higher-quality responses from the AI model. 

WebRTC is a free, open-source project that allows developers to add real-time communication capabilities that work on top of an open standard to their applications. It supports video, voice, and generic data.

LiveKit is an open-source project that provides scalable, multi-user conferencing based on WebRTC. It’s designed to provide everything developers need to build real-time voice and video applications. Check them out on GitHub.

Connect with Russ on LinkedIn or X and explore his posts on the LiveKit blog.

Stack Overflow user Kristi Jorgji threw inquiring minds a lifejacket (badge) by answering their own question: Error trying to import dump from mysql 5.7 into 8.0.23.

Episode Transcription

[intro music plays]

Ben Popper Take charge of your cloud workloads. Control your performance, stack and costs with Equinix dedicated cloud. Sign up at deploy.equinix.com with code CLOUDCONTROL –all caps– for a $300 credit. Get your control freak on with Equinix. 

Ryan Donovan Hello everyone, and welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ryan Donovan, your host for this episode, and I'm joined by a great guest today: Russ d'Sa, co-founder and CEO of LiveKit. Today we're going to be talking about multimodal AI and what it takes to run that– the infrastructure, all the tech stack, all the good stuff that you geeks are looking for. So Russ, welcome to the show. 

Russ d'Sa Thanks so much for having me, Ryan. I'm really excited to be here. Huge fan of Stack Overflow, huge user over my entire, I think, 20-year engineering career. 

R Donovan Always like to hear that. Speaking of your 20-year engineering career, can you give us the flyover of how you got into software and technology and how you got to where you are today?

R d'Sa For sure. I think that my fascination with technology programming dates way back. My dad was very early in the industry in the 80’s and 90’s. He was part of startups, founded a few startups in the semiconductor industry, spent some time in the GPU space, DSL, and then eventually made his way through flash storage and batteries, lithium ion. And so he was all across the board, different parts of the stack, and I kind of just grew up steeping in this environment. I was born in Canada, but my dad ended up moving the whole family out to the Bay Area in the early 90’s. I grew up going to CES and COMDEX with him and joining VC events with him when I was in high school. And I think very early on I started to have a fascination, not just for entrepreneurship, but like a lot of programmers I think from my era, I wanted to make video games. I grew up playing multiplayer DOOM and Duke Nukem 3D and Warcraft and things like that, and always wanted to make it my own and customize it. So what started with using level editors in those games took me to C++ when I was a teenager, and then fast forward to now, graduated college, went straight into wanting to start a company just because I saw that's what my dad was doing. I was in the fifth batch of YCombinator in 2007, which funny enough is where I met my co-founder of LiveKit today. We've known each other for 18 years, but we were working on separate companies back in that batch, and ever since I've been kind of oscillating between starting companies and then joining a company and then starting a company and then joining a company. 

R Donovan Nice to bounce back, to learn a little, put it into practice, learn a little, put it into practice.

R d'Sa Totally. 

R Donovan So today we're going to be talking about multimodal AI and what's behind it. And I think we talk about generative AI a lot and the multimodal aspect of it seems to be the future. Instead of just language or images, it's all of them all at once. What does a tech stack look like for that multimodal set up?

R d'Sa I think that you're right and a lot of people are starting to think about multimodal AI now and how does that fit into the interfaces that we're starting to get accustomed to as we interact with AI. For me, I think this exploration and thinking about it really started right at the end of 2022. So GPT-3 had, I think, just come out and it was this kind of huge leap from the predecessors that OpenAI had put out, and I said, “Okay, well instead of typing to the model, if you speak to it and you convert that speech into text on the server and you stream that into the LLM and then stream out its output and convert that output from the LLM, that text, back into speech, if you do that really fast and stream all the things back and forth over the network, you could actually have a convincing human-like conversation with an AI model.” I think we've all been talking for a while about how you build Samantha from Her. So I built a demo using LiveKit’s stack which I'll get into and the technology behind it and why it's important for this kind of experience, but I built that demo and it was really amazing. It did feel like you were talking to a human. The latency was a bit higher because the technology for synthesizing speech and converting speech into text wasn't as good as it is now back at the start of 2023, but good enough that you could see that this was going to happen. And then a bit later in the year, OpenAI starts to work on voice mode for ChatGPT. We worked with them on that and that was using the WebRTC protocol. And so that gets a little bit into this tech stack, and what you realize, broadly speaking, is that if we're building AGI, the goal of that is almost building a synthetic human. It's like building a computer that can think like, respond like, behave like a human. If that's ultimately what we're going to build, and if you assume that that exists one day, how will you most likely interface with that human-like computer? The natural inputs and outputs that you and I have are our eyes, our ears, and our mouth. The eyes of a computer are cameras and the ears are microphones, the mouth is a speaker. And so as AI gets more and more intelligent, instead of us adapting ourselves to the computer where we type on a keyboard and move a mouse around, the computer starts to adapt to us and our native IO, and our native IO is eyes, ears, and mouths. And so it's voice and it's video, or voice and computer vision. That's really kind of where multimodal AI is going to go and where I think all of AI is going to go over time, or predominantly. It's not like you don't text your friends. You're still going to text with AI here and there, but the vast majority of people communicate with voice and their eyes. And so the technology required for that is quite different than the technology that we have today. The web was designed for sharing documents. That's ultimately what it was designed for when it first came about. It's HTTP– hypertext transfer protocol. So the internet wasn't really designed for sending high bandwidth voice and video, and over time, what's happened is that we've started to do that over the internet, and protocols have been created to do that over the internet. HTTP is built on top of TCP and TCP doesn't work for high-bandwidth data that you want to send in real time because of head-of-line blocking. If packets get lost, you have to wait for those packets to get retransmitted and you don't have any control over that process because it's baked into the underlying TCP protocol. It's an assumption that that protocol makes. There's another protocol called UDP where you don't have to block for packets to be retransmitted. You have more control over that process. You can do retransmissions through a negative acknowledgement, a NAC, but you don't have to. You have programmatic control over that as a developer. And so UDP is really the protocol underneath at the lowest level that you want to use to be able to send audio and video data. On top of UDP is another protocol that exists in browsers that came out in 2012 called WebRTC, and WebRTC is the protocol that you're using when you use Google Meet. You're using sort of a version of it when you use Zoom as well, and even other applications that stream audio and video, that's what you're using, ultimately. And WebRTC is built on top of UDP and provides some facilities around encoding and decoding media. It provides other facilities for being able to access or get a handle to a media device like a camera or a microphone. All of that is built into WebRTC– how to handle congestion over the network and things like that. But the limitation of WebRTC is that it was designed as a peer to peer protocol. And what that means is that when you send media to another person somewhere in the world, you're sending that media directly to them over your internet connection to wherever they are in the world. 

R Donovan A one to one instead of a one to many.

R d'Sa Exactly. And while WebRTC can support multiple participants –you could have someone join and we can be in a call with three other people– when I want to share my video with those three other people, I'm uploading a copy effectively of my video to each individual person over my internet connection. There aren't too many home internet connections that can handle uploading even three copies of high quality video. And so peer to peer WebRTC, kind of the raw base WebRTC that's in a browser, or as how the protocol is defined, just doesn't really scale all that well beyond very, very small sessions. And so the way that people end up scaling that, the first order of scaling is you create a server, a media server, that sits in a data center somewhere and acts as if it's a WebRTC peer. And so everyone has kind of a one on one peer to peer call with that server. We're all sending our video and our audio to that server in a data center somewhere, and that server is acting like a fancy router. And what it's doing is it's figuring out, “Okay, who needs to access who's streams,” and it's measuring the network and figuring out, “Okay, what resolution can this particular network or this link handle to this specific user,” and sending a copy of a lower resolution stream to one user if their network is congested and then a higher resolution copy to another user if their internet connection is fast enough and is pretty good. And that's really what this server in the middle does. It's kind of this coordinator and packet forwarder. There's a second order scaling mechanism that happens because that server is hosted in a data center somewhere. And the downside to that is that if you're connecting from, say, Israel, and I'm connecting from San Francisco, and that server is hosted in Ireland, Israel is connecting all the way to Ireland for the person to send that data to that server, and I'm connecting all the way from San Francisco, and wherever that session originates on whichever server, wherever that server is located, everybody is connecting to that server over the public internet, no matter where you are. And the public internet is pretty noisy. The road system is an analogy that you can think of, and some highways are blocked up like the 880 during rush hour here in Northern California. Some roads aren't as wide of band, they don't have as many lanes. And so you have to figure out how to navigate that and it can sometimes be not of optimal performance. And so there's a second order scaling mechanism here, which is that instead of having a single server that hosts a session and everyone connects to it, you now create multiple servers all around the world, and they kind of mesh together to form a fabric. And that fabric allows me in San Francisco to connect to the West Coast US data center, and the person in Israel to connect to a data center in that region, and then streams are forwarded through the private internet backbone between servers, sometimes with relays in the middle, but you have this less congested ultra-fast link between someone in Asia and then someone in the US. 

R Donovan It's almost like edge computing. You're setting it up very local. 

R d'Sa Exactly. It's an edge network where every user is connecting to the closest edge and then forwarding their streams over the private internet backbone. There's a few benefits to doing that. The first benefit is that you get that ultra-low latency because you're going through the private internet backbone, everyone's connecting at the closest edge. And then the other part here is around packet loss. The majority of packet loss actually happens between a user's computer and the ISP. So as fast as you can get them connected to the backbone by putting the server as close as possible to them, the faster you can react to when there's packet loss and send a negative acknowledgement from that edge server letting the user know they need to retransmit. So that's why there's a benefit too in packet loss scenarios. And so for LiveKit specifically, we work on this infrastructure. We’re an open source company and our open source project is all these SDKs that you can integrate across any device, and then the media server– that server that you can put in a data center and have users connect to it and facilitate these sessions. That next order of scaling where you want to have a global internet fabric or network fabric, that's what our LiveKit Cloud is. We have a cloud hosted solution where we've built hooks into the open source server to allow you and allow us to create an orchestration system. We have a proprietary orchestration system that spins up these servers all around the world, allows them to form a mesh fabric, figures out which link in which region might be having connectivity issues, creates redundant paths to route around those issues, all kinds of things like that. It's also a multi-cloud system and so we don't depend on any single cloud provider. We might be running DigitalOcean in the EU and we might be running Oracle Cloud on the East Coast, and we effectively run an overlay network across all of these and measure their connectivity, measure their latencies between one another, and route packets through the fastest possible path from source to destination. And so this kind of network forms the backbone for not just letting humans connect with other humans, but also humans to be able to connect with machines as well. 

R Donovan I'm interested to think about what does the streams look like when processed and looked at by these large language models or multi-modal models. Because traditionally, like you said at the beginning, you take an audio stream, you convert it to text and you convert it back to an audio stream. So if you just have an audio stream of speaking, are you converting that all into parameterizing, embedding each sort of second of audio? 

R d'Sa It's really interesting because I think the first version of ChatGPT voice, it's really going through the existing infrastructure. You have the LLM that only speaks text, and what happens is you have a LiveKit SDK on the device, the human-operated device– let's call it a phone– and when you speak, those audio bytes are going over the network, they're converted, they're encoded with the Opus codec, they're going over the network, they're getting decoded on the server, and then those audio bytes are getting processed to convert the speech into text, and also figure out endpointing, like when is that user done speaking, when have they said what their prompt or their query is? Once endpointing is triggered, you figured out that the user is finished speaking, you've been in parallel converting that speech into text, you now pipe it through to the LLM and then have the LLM start to generate output. That output is, as it's getting generated streaming out of the LLM, you're looking for some kind of barrier that you can start to generate speech for it. So usually that's a sentence barrier where in the middle of a sentence it's going to sound kind of strange. If you're generating audio and playing through audio before a sentence is done, you might not get the intonations right and things like that. So you kind of tokenize at the sentence barrier and you're generating speech from that text coming out of the LLM as it's streaming out. And once you have a sentence generated, you immediately start to encode it as Opus from the server side, stream it back to the client device –the phone– then it gets decoded and played out. And so that pipeline tends to have quite high latency just because there's many steps in that pipeline. Sometimes you're hitting external services. It depends on what is hosted within the same data center and what isn't hosted, or what are you using a cloud service for. And if you look at what's happening with advanced voice mode which just started to roll out, what's happening there is that the model is taking in audio directly and then spitting out audio directly as well. If you think about it, the first version of this, the V1, it makes sense in that this is what was possible and was what we could do at this moment, but in my opinion, I think that modality switching doesn't really make sense to do external to the model. The reason I think that is that when these kinds of questions come up about what is the future of AI and how is it going to work, I tend to look at how humans work as the marker for what will probably happen. My brain can take in audio. It can take in visual information and I can write, I can speak, I can draw, I can move, and oftentimes, and I think this is interesting because we haven't seen this happen yet, but I can also do some of these things simultaneously. I can be on the phone having a conversation with you while I'm cooking dinner, frying something in a pan on a stove.

R Donovan You have multiple streams. 

R d’Sa Exactly. I have multiple streams. I have multiple output channels and I can take in multiple input channels as well. I can be watching the Olympics while I'm having a conversation with someone or while I'm typing and doing work at the same time. Maybe I can't do all those things as well as I could if I was just focused on it, but I can automatically multitask on different modalities at once. And this all happens within the brain. The nervous system is definitely carrying all these signals and then the brain is able to process which ones and decide what kind of output channels it wants to have. And so what I think ultimately is going to happen with these foundational models as well is that there won't be these external modality switching steps. It's all just going to happen natively inside of the brain itself or the LLM. 

R Donovan That's really interesting to think of it in just the output channels and input channels. It makes me think of the possibility for other output channels like gestures or movements. How hard would it be for an LLM to transfer the semantic understanding it has for language to gestures?

R d'Sa It's really interesting because there's been some new research coming out that seeks to apply what LLMs have shown to be really good at– taking sequences of text and training on those and then probabilistically or statistically generating these output sequences of text– where you kind of apply the same techniques, or the transformer or there's other models as well, where you apply those similar techniques to other types of sequences. So you mentioned audio to audio directly, these are also sequences. If I train on thousands and hundreds of thousands, maybe even millions of hours of speech, then in theory you can also generate streams of speech and sequences of speech, given some kind of speech prompt as input. Of course, I don't think that it entirely means that we will only take in speech and spit out speech. I think that text still has its place just because for the internet, most of the knowledge of the world is still in text. And so you have to figure out how you kind of jointly train this model such that all of the learnings from text are in the same embedding spaces, the audio that you're training on as well. But in terms of how you apply this to all these different modalities into gestures as well, this is something that you're starting to see robotics companies like Figure and Tesla bots start to do. I was reading Elon's biography a while back, a few months back, and there was this one part towards the end of it where they talk about how the self-driving car for Tesla is trained. And they said that there's one component of it where they take all of the video that the cameras have been recording on the Tesla and they train a model, a transformer, that really predicts what the next most likely frame is. And I got these chills for a second because I had this realization that if you could take millions of people and put two cameras in front of their eyes 24/7 and two microphones to their ears 24/7 and you can track everything that they say, you could predict the next most likely frame of reality of existence if you had enough data. It's kind of weird to think about, but the model or the technique for how to do it, I think we understand and we know now. Maybe the compute to do such a thing is not there yet, but the techniques to do it are there. It's just the data plus the compute is all you need now. 

R Donovan That is setting up a potentially dystopian little cousins situation where somebody is predicting every possible next frame and people are getting arrested because of somebody's eyeglass.

R d'Sa Have you watched the TV show, Devs

R Donovan I have, yeah. 

R d'Sa I started to think a little bit about Devs when I started to think about this, and it is dystopian for sure. Some takes on it are dystopian, definitely. 

R Donovan I want to go back to the nuts and bolts of the streaming. A lot of the streams will be coming out of individuals talking to the computers or doing video and there is obviously privacy security around that. And I think a lot of the video streaming folks have been looking at how to make these things secure without having the data recoverable from devices. Is there a way you found to do that? I think some of them have been doing stutter-stop streaming to make it secure and unrecordable but are there other techniques? 

R d'Sa Well, I think there are some obvious techniques or obvious ‘solves’ to this. It really depends on, for your use case, how sensitive are you to the security of the data, or how sensitive are the users to their data being secure, their queries being secure. One thing that LiveKit supports out of the box is end-to-end encryption. So if you want guarantees that nobody can person in the middle your data and that it only can be consumed by the service that you're interacting with and you, the person providing the data, the end user, that's totally possible. You can end-to-end encrypt this information. In theory, nobody can get access to it other than these two parties. Then there's if you do not totally trust the service that you are providing this data to, and you don't want them to be able to get your raw data. A few different mechanisms are possible for that. One is that you can potentially convert your query into some kind of compressed format that in a way somewhat obfuscates the original data. So to give you an idea, you could convert it into embeddings on the client before you send it to the server. That's one possibility, where there's some embedding space, there's a local model that is smaller and its job is strictly to convert from this modality to a vector of numbers and then send that over the wire. That doesn't totally remove the information that you're trying to pass to this service, but it does remove the raw information. It prevents certain things like someone taking my voice and cloning it without my permission. And so that's one technique that I think people are still exploring. It's not something that we support today and it's still an area of research on how to do that and if that's the right approach. There's, of course, what Apple is doing with Apple Intelligence. Backing up before Apple Intelligence, there's kind of a purely local model where you take the LLM and you put that directly on the device. That has trade-offs as well. So of course the information isn't going to leave your device, but the trade-off is that it may not be the most powerful model, or it may not be an expert at something in particular that you need help with. There are going to be limitations to what that model is capable of doing, but you will get the best-in-class privacy and data control guarantees from that particular approach. Then there's kind of the new Apple Intelligence approach. I won't actually pretend to know how exactly it works, but there who you're trusting is you're trusting Apple to a degree to send your data to a server, and then that information they market it as that it is protected from Apple employees. Nobody else can look at it, but the server is receiving this information and then in some way, if it makes an external query to OpenAI or another model one day, I think ChatGPT is the only one for the first version, then it's somehow not passing the raw data but passing some representation of the data. And I'm not actually even sure I'm totally understanding it correctly, but that's another approach. But it really depends on who you're willing to put your trust in as an end user and what can be done with that information that you're providing. And based on whichever one it is, what is the right approach.

R Donovan I think it's definitely something that's going to be tested both in the technology and in the courts because both the privacy aspect and somebody might figure out that there's a PII issue with people's voice signatures and that. 

R d'Sa Totally.

[music plays]

R Donovan Well it's that time of the show again, ladies and gentlemen, where we shout out somebody who came on to Stack Overflow, dropped a little knowledge, shared some curiosity, earned a badge for their work. So today we're shouting out a Lifejacket Badge– it's the smaller version of the Lifeboat. This badge goes to Kristi Jorgji for answering, “Error trying to import dump from MySQL 5.7 into 8.0.23.” So if you were curious about that as well, we have an answer for you. My name is Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. If you liked what you heard today, please leave a rating and review. And if you have other ideas for guests, topics, want to come on the show, email us at podcast@stackoverflow.com. 

R d'Sa I'm Russ d’Sa. I'm the CEO and co-founder of LiveKit. We build real time infrastructure for audio and video applications. And you can find us on twitter.com/LiveKit and also on our website, Livekit.io. 

R Donovan All right. Thank you very much, everyone, and we'll talk to you next time.

[outro music plays]