The Stack Overflow Podcast

Is this the real life? Training autonomous cars with simulations

Episode Summary

Ben Popper interviews Vladislav Voroninski, CEO of Helm.ai, about unsupervised learning and the future of AI in autonomous driving. They discuss GenAI’s role in bridging the gap between simulation and reality, the challenges of scaling autonomous driving systems, the commercial potential of partial autonomy, and why software is emerging as a key differentiator in vehicle sales. Vlad spotlights the value of multimodal foundation models and how compute shortages affect AI startups.

Episode Notes

Helm.ai licenses AI software throughout the L2-L4 autonomous driving stack, which includes perception, intent modeling, path planning, and vehicle control. They’re hiring!

Connect with Vlad on LinkedIn.

Stack Overflow user user3330840 won a Lifeboat badge for their answer to My commits appear as another user in GitHub?.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content here at Stack Overflow, joined today by Vlad Voroninski who is the CEO and co-founder of Helm.ai. We're going to be chatting about unsupervised learning, the future of AI, and how these things apply, especially in realms like autonomous driving, aviation, and robotics. So Vlad, welcome to the program.

Vlad Voroninski Thanks for having me, Ben.

BP Why don't you tell folks a little bit about your background. You are an academic as well as an entrepreneur?

VV Yeah, I actually got interested in computer vision and self-driving cars back in my undergrad days at UCLA as part of the UCLA Computer Vision Lab, which was at the time competing in DARPA grant challenges, which more or less kickstarted the autonomous driving space. My first real interest in college I think was pretty much in that area, specifically in computer vision, but I decided to focus on mathematics for quite some time before coming back to this space. But that was intentional, essentially, because that's kind of the bottleneck that I perceived for AI research. So I did spend about 10 years in academia as an academic.

BP Nice. When I was a journalist at The Verge, I did a deep dive into Velodyne LiDAR, and they had competed in some of the early DARPA challenges, and then Velodyne kind of became state-of-the-art for a lot of self-driving cars. And then I went to work for DJI for a while, which is one of the preeminent drone companies, and they were kind of at the bleeding edge of deploying things into the world that were getting into the hands of consumers that were autonomously tracking, autonomously obstacle-avoiding, doing return to home maneuvers, so I got to be around a lot of that stuff and always thought it was super cool. In my experience back then, the technologies that were important– for example, LiDAR– were being compared to other sensors like radar or sonar or cameras, and what DJI was doing relied on edge processing, but none of this related at all to the AI that we're talking about these days: Gen AI, LLMs. The stuff that you're working on trying to eliminate the gap between a simulation where a car can drive a million miles here and then take that and apply it to reality, is that machine learning, deep learning, interesting AI but not Gen AI, or are you starting to also work with some of the technologies that we hear so much about in the news?

VV When we initially started out– this was now eight years ago when we founded Helm– it was still the early days for technologies like unsupervised learning. At that point, it was still very much, I would say, an open problem, and we saw the solution to that problem as being necessary to actually eventually get to L4 autonomous, so fully autonomous driving. And so I think a lot of our competitors tried to go directly after fully autonomous driving on very aggressive timelines and we were not really concerned about that because we did not think those were realistic, but we did see a lot of value in solving certain deep tech frontier research problems like unsupervised learning, so that's what we focused on in the early days that was motivated by some of the work that I'd been doing in academia in applied mathematics. And within a couple of years, we made a lot of progress and actually trained, I think, the world's first foundation model for semantic segmentation. So basically for any given image, understanding what every pixel means. So it's kind of this very detail-oriented computer vision task. We had a very scalable pipeline for that as far back as, I think, 2018/2017, and that's what allowed us to get a lot of interest from the automakers because we were able to demonstrate higher quality perception technology than some of our competitors, including even big names. So that was the early days, and so I think that there, I would say the focus was on bringing techniques from applied mathematics or just leveraging more sophisticated mathematical modeling in conjunction with deep learning, and it's a technology that we call ‘deep teaching’ that we've been developing for many years now. And in the last couple of years, there's been an inflection point in generative AI. Generative AI became part of the zeitgeist, so to say, for everyone, but obviously for any AI researcher, I think generative modeling is something that was already a known area. I had actually written papers on generative AI as far back as 2018 or so, so it wasn't exactly a new thing in that sense. But there were just certain techniques, certain architectures, and training techniques that scaled extremely well, so if you couple that with a lot of compute, you get really impressive results. But I think what we've been focusing on for the last couple of years is actually combining a deep teaching technology with generative AI to essentially formulate a much more scalable version of generative AI. And so that's essentially what deep teaching is good at– unsupervised learning processes and how to make them a lot more efficient and a lot more scalable, and that's what allows us to achieve higher levels of realism. So you mentioned LiDAR, for example. There’s an interesting debate about LiDAR over the years about how does it compare to the other sensors and is it critical or not. So we do believe that LiDAR has certainly a lot of value to add to the autonomous driving stack, but we are kind of a vision-first approach. That's because vision has the most information. It's the most information-rich sensor. And one thing that we've done recently that I think demonstrates that well is we trained a foundation model called WorldGen which basically is able to simulate the entire stack: the vision, the camera, the LiDAR, the path of the ego-vehicle and physical coordinates, the perception stack, et cetera. But in particular, what it can do is actually, based on existing camera data, it can actually simulate the LiDAR stack. If you can get the LiDAR information from the vision information, I think it's a good example for proving that vision is more information-rich because it's not going to be possible the other way. You can't really go from the LiDAR information to vision.

BP So do you have a ground truth to validate that? You have this simulation, you have vision only, and what you're saying is that just from that raw vision, you can simulate what the LiDAR would have seen and therefore show off how powerful vision is and also understand how it performs in simulation. Is there a ground truth like, you drive in a car, you record the vision, the radar, the LiDAR, then you take that back, remove the radar and the LiDAR, just give it the vision, have it run through the system and see if it matches the original?

VV Exactly. That's exactly right in terms of the ground truth approach and how you can directly measure sort of fidelity scores for what you're able to predict there.

BP That's cool. I'm the kind of person who believes that a really, really well-designed computer system is going to be overall better than human beings who are sometimes drinking or sometimes sleepy or sometimes texting, but we've seen time and again that when an autonomous vehicle is involved in an accident, even if it wasn't the autonomous vehicle that started it– another vehicle crashed into somebody which then got hit by the autonomous vehicle and it didn't slow down fast enough– that immediately creates a political and regulatory clamp down on autonomous driving. All that being said, how much can you learn in a simulated world, how do you solve for edge cases, and then how do you go out and test that in the real world?

VV I agree, of course, that autonomous driving has the potential to be much safer than human driving, because a properly engineered AI system is not going to have those kinds of failure modes in terms of distraction or aggressive driving or drunk driving or whatever, but where autonomous driving systems can be limited is just their overall knowledge of the world and being able to handle all those corner case scenarios. So with foundation models like WorldGen and other foundation models we've trained, the idea is to simulate situations from the perspective of the entire stack that are definitely not typical but are nevertheless realistic. In contrast, if you look at the approach of collecting fleet data from some large number of vehicles, even if it's like Tesla's fleet that has millions of cars, but the rate of occurrence of interesting corner cases has this exponential drop-off in the sense that as your system improves, what is considered an interesting case becomes more and more rare by definition. So the price you pay for collecting relevant data to improve your system actually increases exponentially as you improve, and that's not a good property to have for this kind of development. And so in contrast, if you're able to create simulations that are just as realistic as the real world, essentially having a virtual fleet instead of a real fleet, you're able to simulate only the interesting cases. So instead of an exponential property, you have a linear property, essentially, and all you need is compute and the right software. And of course, the big question is, can you close the gap between reality and simulation? And with generative AI and all the advances, combining deep teaching and generative AI, it does seem to be possible.

BP The things that capture people's attention the most are an AI that can hold a conversation like a human being, or an AI that can make a photorealistic image, or an AI that can generate video off the cuff, but you see where at least those latter two would be super useful in creating a super high fidelity simulation and closing the gap between reality and sim. Let me ask, how widely deployed is your system and to what level of autonomous driving is it being used by auto manufacturers or other players in the space who are trying to move that industry forward?

VV We work with a number of major automotive companies. I can't speak to the specifics as far as those engagements and how many vehicles and all that, but we do have a mature product that is production bound. So we're developing an urban perception system for surround view cameras, and in terms of the level of autonomy we're targeting, we definitely decided early on that we would not be a pure play, fully autonomous driving company, because that has a lot of issues as we've seen from what other companies have done. Whereas partial autonomy, or something like Tesla FSD as the prototypical example, with those kinds of systems, there's a lot of demand for them immediately and they can be commercialized immediately.

BP So it makes more sense as a startup to have a super polished and capable and robust Level 3 versus trying to get to Level 5. Invest in an area where the consumer demand is there and it can actually be implemented today, and then sure, maybe over the long term you're getting to that Level 5.

VV Exactly. But one important thing is that the way that we approach autonomous driving is essentially a unified approach across the different levels of autonomy. What I mean by that is that there are approaches to, say, partial autonomy that don't scale up to full autonomy and also vice versa approach. So for example, if you're relying too much on supervised machine learning, if that's the critical thing you're using, it's not going to scale up to fully autonomous driving. And if you're fully autonomous, if you're relying on LiDAR and HD maps too much, it's not going to scale down to a partial autonomy system. It's going to cost too much. And so what we pursue is a kind of a vision-first approach and also an approach that leverages foundation models with deep teaching and generative AI that allow us to scalably resolve corner cases. And so for any given production requirement, whether it's L2, L3, or L4, once you define the hardware stack and the operational domain, we're able to build the optimal software solution for that using the same pipeline.

BP You mentioned Gen AI, and one of the interesting things that I saw which we haven't really seen come to fruition but which companies like NVIDIA and Google showed off, was this idea that when you add an LLM into the robotics stack, there's a new facility in engaging with the world. Now you can say, “Hey, I've got a lot of toys on the table. Can you grab the extinct animal and bring it to my kid?” and it picks up the dinosaur. Or, “The table looks messy. Can you make sure the silverware and the plates are set up for dinner?” and it goes and does that because it's a multimodal model so it understands language and it understands vision. When you said before, semantic analysis, and when you're talking about what your model does, does it not just see the world and know, “Okay, when I see something that looks like a bike, this is how I should respond to that,” but it also has an understanding of, “This is a bike. This is a person. This is a car. This is a stoplight,” in the way an LLM might have an ‘understanding’ of that?

VV There's a couple of ways to think about it. So if you just feed in data that, let's say, contains just video data of the world, and the question is, can you use just video data to learn about the environment, if you fed enough data into these foundation models, would they learn how to replicate highly realistic behaviors for all the different agents? I think the answer is roughly yes, provided that you had the right data, but there's of course always going to be situations where the amount of data that you might have about certain situations is not going to be as plentiful as you might like. And so I think that bringing in multimodal foundation models, leveraging text information in conjunction with vision information, makes a ton of sense because there's going to be certain concepts that are more well represented in that text format that you can then bring into the visual representation and really benefit from that. So I do think that that is certainly valuable for many perspectives in terms of increasing the accuracy of the simulation, in terms of using it to describe what's actually happening with the driving system, et cetera. So I think we're going to see more and more of those kinds of foundation models for sure.

BP That's cool. Some of the early folks who were at the cutting-edge labs with some of the big tech companies said, “Mark my words, in 10 years my kids will not be driving.” It seems like that won't come to pass. I still have a few years until my kids are teenagers, maybe I'll get lucky and they won't have to drive. But give me your sense of where we're at now, what you're excited about, and what you hope to see in the next year or two in terms of breakthroughs. I think that there's a dichotomy of both that there are these autonomous taxi systems happening in certain cities where things are really well mapped out and the speeds are low enough. They provide enormous benefit to people, certainly people with different abilities. Unless you're in one of those cities, autonomous driving has probably fallen to the back of your mental radar. It's not in the zeitgeist in the same way. So how do you look at it? Have we made progress? Are you excited for what you see in the year to come? And what's what's your company going to focus on, for example, to continue to drive –no pun intended– to continue to push things forward?

VV So I do think that there's going to be some exciting developments in the space for sure, both in terms of those kind of Waymo-type approaches as well as the Tesla-type approaches. So from the perspective of, given a specific hardware stack, how do we build the optimal software that fits into that for a particular operational domain, I think there's been quite a lot of progress on that problem, so that is not necessarily the bottleneck anymore. Whereas six years ago or something in the early days, that was still a big bottleneck because supervised learning was the predominant method. It just didn't quite even scale into that. So I think that where you're going to see those fully autonomous fleets, I think the challenge there is, of course, how to make it safe enough, but also how to make it commercially viable because the cost associated is still far too high in those contexts. What I think is really exciting is that you're going to see many automakers, by necessity, pursuing an approach that is actually much more heavily leveraging partial autonomous driving as the primary commercialization strategy, similar to Tesla. That is Tesla's approach, ultimately. They launch an FSD system that is not fully autonomous yet, but they hope to get to a certain point where they can just flip a switch and their millions of cars become autonomous simultaneously. That's very powerful. So other automakers are not positioned to do exactly the same thing in the sense that they just don't have that fleet, and so you're years behind in having that kind of fleet. But with the advent of generative AI, they're going to be able to leverage simulation, partnering with companies like Helm, et cetera, to actually get there much faster but still pursue that partial autonomy approach because software is going to be the key differentiator for vehicle sales– one of the key differentiators in the coming years. So I am excited to see that come to fruition. I think it's going to be a big change for the space because in the earlier days there were a lot of false promises about timelines and I think it caused certain investments to really not do well, but I think pretty much the entire autonomous driving industry now understands that that was not the right approach and they're pivoting toward this kind of Tesla-style approach, but they're having to leverage these new technologies that are kind of cropping up at a good time.

BP That's interesting. Let me ask you a question. As other major auto manufacturers deploy more and more semi-autonomous or autonomous helper features into their cars, are they gathering data from those fleets that they can then use to train with? Every time I get out on the road in my car and I turn on lane assist and follow, is that data that is going back to the auto manufacturer and being used to help?

VV Absolutely. With any kind of fleet deployment, you want to leverage the data that's being collected, or at least the interesting cases that you collect. But that's going to be valuable in proportion to the size of your fleet and it just takes a long time to get to a certain fleet size. So we envision being able to launch systems that are highly competitive with something like Tesla FSD without having those large fleets. I think that that's going to be important for vehicle sales.

BP I was looking up the WorldGen as we were talking and then I got to your blog and there was something in here about how Helm was able to do some pretty autonomous work with only a camera and one GPU. Is the massive interest in building frontier models and staying ahead of the competition from the major tech companies making it more difficult for a startup like yourself to get access to GPUs or to find the compute you need to train? Does having those folks do these things at huge scale mean that makes it cheaper and easier for everyone to do it at scale, kind of like cloud computing did, or you're going to be on a list waiting two years to get the A100, H100, next great chip that is going to be what we use for AI?

VV Firstly, the one camera, one GPU, that's from, I think, 2018. That was our first demo where we had an autonomous system going down a steep and curvy mountain road, and we showed very favorable performance compared to an existing AI system. These days we work with a full stack system that has multiple sensors, but I wouldn't say that compute shortages have really impacted us. I think that generally the cloud providers want to work with cutting-edge AI companies, but we have not really had issues securing compute. It depends on how much compute you're trying to secure. If you're trying to get thousands of H100s, then maybe it becomes a different conversation. But we've seen NVIDIA’s stock price go crazy and I think that they're doing as much as they can to keep pace with demand. So it hasn't really impacted us from that perspective, but it's something that you just have to pay some attention to and make sure everyone is aware of your needs. But it has not been a blocker.

BP All right, last question before I let you go. What is your embrace of autonomous driving like or not? Are you out there in an FSD letting it do its thing and both appreciating that and studying it, or how autonomous are you on the road?

VV We perform a lot of autonomous driving testing on our fleet, so that's part of just the regular testing framework where we're doing both just gathering data for validation purposes but also testing out software updates pretty frequently. We do also competitive benchmarking, so to say, so I've definitely used the FSD system and other autonomous stacks. So I very much do embrace those capabilities and it's interesting to see the progress that various companies have made. I’m definitely kind of living and breathing in that world, so to say.

BP Cool.

[music plays]

BP All right, everybody. It is that time of the show. Let's shout out a user who came on Stack Overflow and helped to spread a little curiosity or knowledge. A Lifeboat Badge was awarded to User 3330840 for helping provide a great answer: “My commits appear as another user in GitHub.” You want to get credit for those commits. If this has happened to you, we've got a good answer for the question and helped over 4,000 people, and congrats again to User 3330840 for your Lifeboat Badge. I'm Ben Popper, I'm the Director of Content here at Stack Overflow. Find me on X @BenPopper. Email us with questions or suggestions for the program, podcast@stackoverflow.com. We take listeners on as guests, we talk about things folks suggest, you can always write for the blog, we like fans to participate. And if you liked today's episode, leave a rating and a review. It really helps.

VV My name is Vlad Voroninski. I'm the CEO of Helm.ai. We're an autonomous driving company pushing the state-of-the-art with frontier foundation models. Definitely if you're interested in the space, visit our website, we're always hiring. It's just helm.ai.

BP All right, if you're listening and you're thinking about getting into the space or work in deep learning, check it out, maybe they’ve got a job for you. Thanks for listening, everybody, and we will talk to you soon.

[outro music plays]