Ben and Ryan chat with Babak Behzad, senior engineering manager at Verkada, about running a pipeline that vectorizes 25,000 images per second into a custom-built vector database. They discuss whether the speed is due to technical brains or brawn, the benefits of processing on device vs. off, and the importance of privacy when using image recognition on frames from a video camera.
Verkada is a cloud-based video security company.
Back in the innocent days of 2021, we spoke with a company that makes smart dashcams. See how far video and image processing has come.
Congrats to Reg for earning a Lifeboat badge for their answer on What is the difference between JSP and Spring?
[intro music plays]
Ryan Donovan The Stack Overflow community has questions, our CEO has answers. We're streaming a special AMA with Stack Overflow CEO Prashanth Chandrasekar on February 26th over on YouTube. He'll be talking about what's in store for the future of Stack, and you'll have the chance to ask him anything. Join us.
RD Hello, everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ryan Donovan here at Stack Overflow, and I'm joined by my once and future colleague, Ben Popper.
Ben Popper Ryan, I'm back. I'm back from the dead.
RD If you love something, set it free.
BP Exactly. Hi, I'm Ben Popper, podcast host here at Stack Overflow, former Head of Content, now over at builder.io, but will never relinquish my death grip on the microphone here at Stack Overflow. Too much fun, too many great memories.
RD Too much fun. Well, hope it's a fun one today because we are joined by a great guest: Babak Behzad, Senior Engineering Manager over at Verkada, and we're going to be talking about AI, of course, but AI processing of images, super fast pipelines, and building your own vector database and what that takes. So Babak, welcome to the podcast.
Babak Behzad Thank you. Thanks for having me. I'm super excited to talk to you about this.
BP So from a highest level, let's just give people a little bit of context. I know you worked at a bunch of different startups doing a bunch of different kinds of technology over the years. We can get into that later. What is it you're doing today and what is your company focused on?
BB So Verkada, as you know, is a leading video security industry, especially on the cloud, and so our mission at Verkada is protecting people and places in a privacy sensitive way. And so our founders at Verkada started by building cameras for video security and now we have a platform, basically, a set of different devices from alarms to access control, intercom, guests, and obviously cameras to basically protect people and places. We have over 28,000 customers using Verkada on a day-to-day to protect their safety of their people and places.
BP You mentioned you have 20,000 customers. Can you just describe briefly, are we talking principally about commercial customers, large scale facilities, are we talking about high-end residences, are we talking about your regular mom and pop store? What does the average customer look like for you?
BB So at Verkada we are actually very lucky to have all different diverse use cases of customers, so we have gyms like Equinox, YMCA, we have mom and pop shops, as you mentioned, like Megan's Organic Market, you have the local Goodwills, all global footprints of Canada Goose and retailers. We also have the biggest customer segment of schools, so a lot of schools in the United States are now using Verkada to protect their students. And we also have a lot of hospitals and healthcare, and we have a big footprint in manufacturing. And in fact, our AI models are actually very helpful for manufacturing use cases– checking people are wearing safety nets, PPE. And so AI search is actually very useful for them as well.
RD Interesting. So talk about the image processing creating vector embeddings for a huge amount of images. What is it– 25,000 images per second? So what kind of AI processing? You're creating the vector data and embedding for these? And what kind of tech does it require to have it move that quickly?
BB So let me give you a little bit more background. So I joined Verkada mainly because of a very exciting project we call AI search, or AI powered search. Basically, we have more than 1 million cameras around the world now that are streaming, and obviously we can't send all the video footages. It's going to be using a lot of network bandwidth and a lot of data, but our cameras are smart and they do object detection on the device. So we’re on the device and we detect objects. If customers enable people analytics or vehicle analytics, we detect those objects and we send those back to our back end. It's very easy nowadays to run the smaller models on devices, but for AI search, which is a product that we built since last year at Verkada, the models will use the latest large language models and vision language models, which we will get into the details, and so running those on cameras at this point is not possible so that's why we have to send all these. As you mentioned, in peak hours we have around more than 20. I just checked actually yesterday, and our peak is around 30,000 images per second, so we have to run these images against the AI models, get the vector embeddings, and we also have to store them into our vector database so that later on what happens in the product is customers can actually search in anything in natural language. So you can search for a person wearing a hoodie or the color of the hoodie or any identifier for people or vehicle. It actually works really well with vehicle make and models. It has understanding of the world, because the model was trained on two billion images and their captions so it knows everything about the world. And so being able to build this pipeline was a big challenge for us and also making the search very fast.
BP So when you say it was trained on images and captions and then it's sending you back images as frames, does it understand motion? If I said, “Find me a guy who's walking versus a guy who's running,” does it understand that and does it understand that frame by frame? Because you said an image language model. Obviously we know multimodal, but is it images or video or both?
BB That's a great question. So we leverage a model that was published by OpenAI in 2021 called CLIP– C. L. I. P– Contrastive Language Image Pre-training, and this model was trained on images and their captions as I mentioned. OpenAI never released the weights, but there are open source versions of this model trained by the open source community and we picked one of those called OpenClip. And the model is only capable of understanding images, and that's why on the camera and the devices we actually detect using motions and using YOLO models which understand multiple frames. It detects the people, takes a crop of that, actually a high quality version of that subject, either person or vehicle, and then sends it back and we only index those images. Although, because it was trained on a lot of data and it's a very capable architecture, so if you search for a person running, it actually shows a lot of people who are running, although it doesn't look at multiple frames, but it understands the attributes of running, either the clothing of the running or having the running person. It works pretty well but we advise our customers not to use it for action detection. We actually have some projects right now under development to be able to run CLIP or some other models, similar models to run on multiple frames, but the one that we launched last year was only on static images.
BP Got it.
RD I want to get back to the processing speed. I wonder if you think the processing 25,000 images per second is primarily because of clever algorithms or do you just have a lot of hardware behind this?
BB That's a great question. So one of the big challenges that we had was actually keeping the cost of this project under control, because as you know, we run everything on the cloud so we have GPU instances. And so as you mentioned, we had to do a lot of combination of algorithms and also throw in some GPU instances to make the pipeline possible. But just examples of algorithms– we do a lot of deduping. We make sure that we don't index duplicate images. That's one of the things that is on the algorithm side. We do the pre-processing of the pipeline earlier so that we get the GPU utilization almost 100% all the time. We also use the latest GPU optimization techniques. So Nvidia has a very good open source engine called Triton, but it has a lot of optimizations, things like TensorRT which quantizes the models and runs them on smaller GPU instances.
BP So we were talking a little bit about the level of this and the fact that this is being done on the edge. I know we want to get into what it takes to do this and the vector database that you stood up from the ground. I'm curious, do you foresee that the future of this kind of hardware and software will become nanoscale models that can run on the edge and then converse with a larger model in the cloud as needed? You can now put DeepSeek on an iPhone and get some pretty incredible results. The open source community seems to always be taking the best of what the big AI labs do and finding ways to get close to parity plus shrink it down. What's your perspective on that?
BB I pretty much agree with you. So Apple is also doing the same kind of pattern that you have smaller models run on-device. That's exactly what we are also thinking. As I mentioned, the object detection currently, the model is very capable, almost state of the art. We are capable to run it on our devices and that's why we moved it to the devices. It can run on multiple frames per second, no network bandwidth required, and it's also a lot more cost effective. But currently to support AI search and leverage this latest technologies in large language models and vision language models, we have to run the biggest model, the CLIP models, on the cloud. I can definitely see that these models get distilled, get smaller, and then we can run them at higher frames per second on-device and then maybe later on they are used to actually trigger, as you mentioned, it triggers an action, and then a bigger model can help them to actually identify something that our customers are interested in to send alerts or search.
RD I think anytime I talk to folks who are doing AI processing of camera images, there's always a concern of how much privacy are you doing. I talk to folks who have the in-car ones, and you don't pick up faces, you don't pick up license plate numbers. Are there any things you actively try to avoid recognizing there?
BB So as I mentioned in our mission, we take privacy very seriously at Verkada, so there were a lot of hard requirements to do this project. For example, one was that everything has to be hosted on our own cloud. So we are not sending images or anything to OpenAI or any third party API. Also everything is controllable by our customers, so you can disable any of these people analytics or vehicle analytics at the camera level, at device level, or at the org level. The AI search feature which uses this CLIP model is also a separate feature that you can turn on or turn off based on your needs at the organization level. We also have very customized permissions, so for example, only organization admins can use this feature rather than even site admins or site viewers. On the implementation side, we have multi-tenancy, so every one of our organizations on the data governance side are stored separately completely on their own shards in the vector database that we built in-house, so there won't be any sharing of the data. All the embeddings are actually encrypted so nobody has access and every organization has their own key. So we have taken all the security measures in mind when we designed and implemented this project.
BP So let's get into the juicy details here. This is what the audience came to hear. They want to know about building and running your own cloud and your own vector database, not the strategy that most startups would adopt because there's a lot of facility and ease and leverage for scale and piggybacking off what the big players have already done, but for something where the data is as sensitive as this, it kind of makes sense. So talk a little bit about the decision to do it, the planning that went into it, and how it actually got architected and works in production.
BB For sure. So as I mentioned, privacy was definitely one of the biggest thoughts in the mind to design the system. Scalability was number two, cost effectiveness was also another one. And so because of these, we decided to host all the models in-house. In fact, Verkada, as I mentioned, does not send any of the customers’ images or videos outside. And so all these models are hosted in our own cloud and that adds a level of complexity because now we have to host all these machine learning models on our own GPU instances and stuff. But it was a great engineering challenge. We have a lot more control on it, optimize it, and so that's one of the things that we had in mind, so hosting the models locally. As you mentioned, vector database itself, we could also leverage some third party vector database in cloud like Pinecone. We first, in fact, started looking into, again, because of privacy, we decided not to send the data, but we started looking at open source. So we looked into Qdrant and Weaviate, which are some of the rising vector databases that are out there that especially the LLM users use. Very soon, we realized that we actually built a POC with both of them, but very soon we realized that our use case is very different because most of these vector databases are mostly read intensive. So you ingest your documents, let's say for RAG purposes initially, and then you read a lot of them. You just send the vector and try to search. Our use case is a bit different because of the scale of all these images coming in so it's a lot more write intensive. And so with that in mind, we started looking at the architecture of these databases. We got some inspiration about how they are implemented. Also because of privacy and cost effectiveness we decided to build it in-house. And so we can discuss the design in detail, but we basically just got a very simple design, designed it to use NumPy arrays to store these because they're very efficient to store vectors. And then we leverage a blob store for storing all this data, a lot of caching, a lot of optimizations on top of it, both for indexing and for searching, and then we have our own vector database which is working really well so far. So that's on the vector database side, and then we did a lot of optimization for the search side to keep the caches warm as the customers are searching. And the implementation of the vector database, as I mentioned, is using NumPy and Dot Product to find the closest vectors. We implemented both ANNs, approximate nearest neighbor vector search, but also actual nearest neighbor because we opted for more high recall to show the most accurate results to the customers because most of our customers’ use cases are like finding a needle in a haystack. They are doing an investigation, they're looking specifically for something. So for us, we decided not to do approximate nearest neighbor because we just wanted to find the closest vectors to what customers are doing and so currently we actually do the exact matrix multiplication to find vectors.
RD It's interesting, we've talked to a lot of folks who either just have a vector database or they're a company that's bolting the vector onto other things, and I wonder how hard is it to add vector database capabilities? Is it just like you add a column that's an array of 768 floating point numbers and call it a day or is there something more complicated to it?
BB So again, it depends on the use case. For us, vector is actually the first class for this vector database. We have a bunch of metadata, so things like cameras, which camera is this, or timestamp obviously, when did this happen? We added the metadata also. We stored it with the vector so that we can actually filter it very, very quickly. But for us, vector was the main thing that we store and that's why we can also do the search very quickly. But depending on the use case, I've seen some other companies use Elasticsearch and just add a new type of vector search, like VectorType or PGvector also allows you to do it for Postgres. But as I mentioned, our use case was mostly just build a vector store that is really good at scaling for writes, and then also make the search fast to find. Also one other thing that I wanted to mention is, for investigation purposes, the expectation is not real time search. We are not optimizing for sub-second search latency but we are optimizing for a few seconds latency, and so that also had a big impact on the design. So customers, when they're doing investigations, they're okay with waiting for a couple of seconds to get their results, while in some use cases you are looking for sub-second latency.
BP Right. What do edge cases look like in a vision model like this? It's interesting what large language models can and can't do, the things that trip them up that are simple to a human being but complex to them, even if the language model is at a PhD level in 17 different topics. What does that look like on the vision side? I'm at a school, I want to know when a child is leaving and entering the building, but suddenly I have a very diminutive adult but they’re bald and have a beard. Is the model going to say, “Hey, that's a child because they're this size,” or are they going to say, “Hey, that's an adult because they have a beard.” What does an edge case look like?
BB That's a great point. Let me also just mention one thing before answering your question. So one thing that we realized is especially CLIP model, and I think, as you know, a lot of these vision language models are actually using CLIP to vectorize images, and this CLIP model, one thing that we realized is we tried first to run it against the whole thumbnail of the of the camera. And because there is a lot of detail, they actually fail into seeing and paying attention to every single detail of the image. And so the winning point for us was that because we had the object detection running on-camera, we are actually running the CLIP model only on the crop of the person, let's say, or the vehicle. So there is a lot less narrow thing for the CLIP model to actually pay attention to, and that's why, for example, if you give it a picture of a crop of a vehicle, it’s really good at identifying the make, model, color of this, but if you give it the whole thumbnail of the camera, then it may not even see the car or definitely not able to find the make.
BP Before we get to the edge cases, that reminds me of something funny. We're getting into an era– I’ll timestamp this Friday, January 24th. Many AI providers are now rolling out these operators where the idea is they'll use the computer for me. And I heard something very funny that at least earlier iterations tended to get easily distracted by advertisements. You'd send them off to do something, but if you showed them the whole web page and it was like, “Buy me boots,” and there's an advertisement, they're going straight to the ad. So you’ve got to be careful with that.
BB Exactly, but again, one other thing that we have seen a lot in our experiments is what is the background. Sometimes you can even search for a person wearing red and it shows you a person not wearing red, but maybe there was something in the background that was red, and it's not great at identifying that it has to tie the clothing color with that red color. And I actually wanted to also bring up something that I think was a very important thing for us that we built. So because of seeing these kinds of cases, we started thinking of fine-tuning CLIP. It actually worked very well for person attributes like clothing color, wearing glasses, these kinds of things, especially because a lot of our customers look for conjunctive queries or negative queries. As I mentioned about the PPE use case, a lot of our customers asked us can you support ‘not,’ so a person not wearing a safety vest or a person not wearing a helmet. And so we spent actually one full quarter on researching and figuring out what's the best way to support this, and as I mentioned, I think fine-tuning is definitely one very valid way. But we came up with a very smart way to support this, which is actually using LLMs to break down the query into atomic queries and then search these atomic queries separately using the clip embeddings, because again, the atomic queries make the query simple, smaller, just looking for only one thing, and then combine these results in a smart way to support ‘and’ ‘not.’ One other complexity about this problem that we solved is how do you set the threshold on the returning results? Because what you do is you search, let's say, again for a query. First of all, based on the query that you search for, the scores that you get from the model for all these images that it searches for has a different distribution. If your query is very fine-grained, the scores are very different than if your query is very general. And so where do you put that threshold of saying, “From this score above, these are the results that I need to return.” So we spent a lot of time on, we call it dynamic thresholding, how do we set the dynamic thresholds. Because just to give you more context, if you do say face search for identifying a face, you get the embedding of the face and then the score is mostly for everybody for all humans is just one static threshold. You say, “If the score is more than this threshold, this is the same person.” But for CLIP or AI search product that we worked on, based on the query, these score distributions change so we had to spend a lot of time on finding that threshold based on the query. And so that's a very important thing to consider when you're working with these large language models in a ranking or retrieval problem.
RD Are people able to search images with another image?
BB That's actually a great point. So that was actually a follow-up that we released after the initial release. We got feedback from customers that a text-based query is awesome, but I also have an image of a person, let's say they're looking for a person, or I have something similar. And so what we worked on, we call it ‘more like this’ feature that you can actually either upload an image or even in the search results because we return a diverse set of results, you can say, “No, I'm looking for more like this result.” And what we do in that case is we already have the embedding of that image. Either you upload it to get the embedding from the model or if you are returning it, we have already the embedding of the model. We also have the text embedding of the query that you have, and the good thing about these embedding models is you can do embedding algebra on them. So we do a weighted average of the text embedding and the image and it actually returns more images that are more like that person.
BP One thing that I've seen which I thought was really interesting is the idea that people who are concerned with facial surveillance or government surveillance can use a countermeasure where they wear a certain pattern, almost like a QR code or some kind of abstract design, and it really throws off a system that's used to looking at your standard human being in clothes and cars. Is that something you have to think about?
BB Yeah. So specifically not for AI search, but we have seen for our face search feature, we have seen very low quality faces, even angles. Some of the examples that you talk about can definitely throw off the embedding and they don't work very well. So the current solution that we have is we run a face quality model on top of our face embedding model, so as we get the faces, we run it through a face quality model, which its whole purpose is actually to give you a quality measure of how high quality is this image, not just in terms of pixel quality, but also angle, things that can throw it off, even wearing a face mask or glasses. And so we only index those images that pass that quality threshold. But I know what you're talking about, these adversarial examples. At least I haven't encountered it yet at Verkada.
RD I think they called it the dazzle makeup. It's interesting that you talk about the face quality model and then there's another model on the camera to track objects. How many different models do you have running at any given time?
BB So for our face recognition system, again, customers can enable or disable based on their needs. We do three models, one after the other, and actually we have an implementation of this that can run on our latest generation of cameras that are more capable of running CV models, all local, all on-device. So the initial one is the YOLO people detection, and then after it there is a face detection. The face detection is actually doing both face detection and quality together, so not only it gives you the bounding box around the face, but it also gives you a quality. And after that, it's the face embedding model which gives you a vector for characterizing that face. We also have these models running on the back end using the NVIDIA Triton service that I mentioned. There are three of them, and Triton actually allows you to do ensemble models so you can actually run them one after the other, keeping the GPU utilization high and increasing the throughput. And so we have to do that to support our older generation cameras, and as I mentioned, we have the CLIP model running for our AI search. We also have another feature at Verkada called attributes models which was actually before we implemented AI search, which was custom trained classifiers or custom trained models for just gender appearance, like is it a male or a female. We have another one for upper body color, which gives you nine different colors. We have one for lower body and we have one for wearing backpack or not. For vehicles, we have the same for type of the car like SUV, sedan, truck, and also the color of the car. The beauty of the advancements in AI and making CLIP and these large language models and visual language models becoming more general is that it prevents us from needing to train one classifier for each of the attributes that we want to support. So if we didn't have the CLIP model, we’d have to basically have our machine learning engineers listen to customer feedback, keep adding and training new classifiers for every single feature that our customers are asking, but now we are actually leveraging CLIP. And one thing, because I know your audience is interested in the user experience, so the initial product around AI search was just to give customers a natural language input text box so that they can search for anything, but then we realized that a lot of our customers are actually used to using filters. And so we are soon launching actually more filters which are backed by the CLIP model, but for specific filters for make, specific filters for models, specific filters for identifiers of a car, let's say if it has a roof rack or not, a rear-mounted wheel, damage in the body. These are the things that our customers have been asking for. They can search for it in natural language and use the CLIP model to actually find the responses, but we also decided for user experience to make it actually specific and a called-out filter in the product as well so that they know it's more discoverable, basically for them.
BP I'm curious, the experience that I had with computer vision started with DJI drones which I used to review when I was a journalist and then I went to work for DJI, and one of their most recent units that they put out the selling point is no more learning about a remote control, just tell it what to do, it sees the world and it'll do whatever you ask it to do. So you just talk to it or you have these glasses on, you say, “Go forward, go left, go right,” it follows your commands. What kind of stuff are you excited about in the coming year in the advancement on sort of the vision security and AI side?
BB So throughout my career, I've been working, it's been almost 10 years that I've been trying to apply AI to different problems. I started with customer support, marketing, and when I saw the opportunity at Verkada that the video security can actually benefit a lot from AI, I got very excited and I was actually absolutely right a year and a half ago when I joined Verkada that there is a lot of potential to apply AI to video security to protect people. And so I think this AI search and this initial embedding models like CLIP is just the first step. I'm very excited about, as we already mentioned, activity recognition models like the ones that work on a temporal basis as well. They can understand multiple frames. You mentioned the physical models. I think those are also very exciting for our use case– the model that can actually understand that this is a camera, this is the scene that it has seen, and then identify a threat or something that is abnormal in this scene and then alert. So in general, the thing that excites me most at Verkada is that we can actually make these cameras and also our other products like alarms a lot smarter and a lot more aware of threats and situations to even prevent, or as soon as some threat happens immediately.
BP I've been watching the operators and I'm imagining now it sends a signal, it’s like, “Pretty sure this guy on the line is not wearing the hard hat. Just checking in,” and then the human verifies before you dock that guy's pay, but it's almost communicating with you and then you're going to have to dial the temperature up and down, is it a little too nervous, is it sending too many false positives or whatever, but really interesting. “Seems like kids are leaving school through this back gate. School's not over yet. Might want to look into that.” That's such an interesting idea.
RD So we talked a lot about the sort of video AI image models, but how much do LLMs play a role in this?
BB Again, I think LLMs, in addition to VLMs, LLMs are also helping us a lot to build these products. It's much easier nowadays to extract the time from the query that the user has, because again, most of our use cases are investigation so they know roughly when it happened, where it happened, and using LLMs you're able to extract these, match them to the camera or site names that the customers have, and also fine-tune the search and find exactly where should we search for it.
[music plays]
RD Well, it's that time of the show again where we shout out somebody who came onto Stack Overflow and dropped a little knowledge. Today we're shouting out a Lifeboat Badge winner. Congratulations to Reg who found a question that had a score of -3 or less and dropped an answer so good that it got 20 or more points. So the question is: “What is the difference between JSP and Spring?” If you're curious, we'll leave the question in the show notes. I am Ryan Donovan. I edit the blog and host the podcast here at Stack Overflow. If you want to reach out to us, you can reach out at podcast@stackoverflow.com. And if you liked what you heard, drop a rating and a review.
BP We want to know is this user's name pronounced Reg, Regex, Reggie? It's hard to know. Reg-X, okay, got it. I'm Ben Popper. I'm one of the hosts of the Stack Overflow Podcast, do some work these days over at builder.io. Find me on LinkedIn, hit me up with questions, happy to collaborate, anything to do with software development and making cool stuff.
BB I'm Babak Behzad. I'm leading the search team here at Verkada, a leading cloud-based video security company, and I'm excited about AI, machine learning, and engineering.
RD Well, thank you very much, and we'll talk to you next time.
[outro music plays]