The Stack Overflow Podcast

Fighting comment spam at Facebook scale

Episode Summary

Louis Brandy, VP of Engineering at Rockset, joins us for a deep dive into the architectural similarities across AI, vector search, and real-time analytics, and how they’re all at play in shaping the infrastructure to fight spam.

Episode Notes

Rockset is a real-time search and analytics database. Explore their docs and developer tools here.

We here at Stack Overflow recently implemented our own vector search. Here’s a technical deep dive into how we did it.

Louis is on LinkedIn.

Three cheers for Lifeboat badge winner user7610, who rescued C++ application terminates with 143 exit code. What does it mean? with a solid answer.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, Director of Content over here at Stack Overflow, joined as I often am by my colleague and collaborator, Ryan Donovan. How’re you doing, Ryan?

Ryan Donovan I'm good. How’re you doing, Ben?

BP Pretty good. So we have a great guest coming on today: Louis Brandy, who is the VP of Engineering over at Rockset. We're going to be chatting about some things that have been big on the blog and the podcast recently like vector search and AI/ML. But I'm also super excited for the chance to talk a little bit about spam fighting, which is something that is this giant war, this giant effort that's always happening in the background, bajillions of emails getting sent and blocked and slipped into your inbox every day, and maybe how some of these emerging disciplines, real-time analytics, and AI with vector combine with spam fighting. So Louis, welcome to the Stack Overflow Podcast.

Louis Brandy It is great to be here. Thank you for having me.

BP Of course. So for folks who are listening, just give them a quick background. How did you end up at the role you're at today, and a little bit of a short CV if you don't mind, of some of the stuff you've done in your career.

LB Absolutely. So I'll give you the criminally short version. My first role out of school, I was actually doing face detection and face recognition. So I started off as a software engineer in the ML world, pre-deep learning, so this is back in the dark era. I actually got out of that business, I sort of transitioned away from that. So that company was eventually acquired by Google, but I didn't go to Google, I went to Facebook. So I worked at Facebook for about 10 years, roughly in thirds. So the first third of that was spam fighting; this is the spam fighting era for me. I worked primarily on infrastructure for fighting spam. So I wasn't fighting spam directly every day, I was more building the infrastructure that we would use to fight spam so I have a lot of opinions on how you build a scalable spam fighting system and what you need for scalable spam fighting. And that actually ties into real-time and some of this AI stuff we'll be talking about, so it’s a serendipitous kind of background.

BP So back in the old days before ML, does that mean you were hard coding rules about what a nose and a face and an ears and eyes look like to do face detection?

LB No, we were doing boosted trees type stuff for face detection and face recognition. So it was still machine learning, but it was sort of in the kind of decision tree world, not the neural network world. I'll be honest with you, I'm an AI implementer throughout my career. I've never been a scientist, so getting into the theory of it all I don't really know, at least not enough that I would pretend to be an expert on it. But I've done a lot of the implementation of these kinds of things like how many frames a second can I get out of this inference versus that kind of inference. That was sort of more my thing than the other way around.

BP Gotcha, gotcha. Okay, I interrupted. Please take us through the rest of this short career survey.

LB I then did a bunch of core infra at Facebook, so I went from spam fighting infra to core infra. And here I meant more just general stuff, so I was building libraries for services. So I worked on Facebook's RPC system Thrift and service routing and things like that, load balancing. And when you're doing these core libraries, my baby, my example service was always the spam fighting services because those are the ones I knew well. So we were building libraries for services, but I was a service owner. Former service owners had a lot of experience and I did things like I was doing a lot of our core C++ libraries on Facebook. For example, I was on the C++ standards committee and did a lot of core library stuff. This leads me then eventually to Rockset, so Rockset is a real-time analytics and search platform. We're adding vector search, we're going to talk about this. These fields seem distinct. For example, we talk about spam fighting, we talk about AI and vector search, we can talk about real-time analytics. They're actually much closer than you'll think. Architecturally, these systems all look very similar once you peel off some of the top level first class citizens. They look very similar under the hood. But that's what I'm doing today. So I'm the VP of ENG at Rockset. We are a real-time analytics platform. I don't know how familiar people are with the various buzzwords. It can sound very buzzwordy if you don't know what any of these words mean, but basically very high rate ingest with very low data latency so that you can query recent fresh data as quickly as possible, and obviously, fast queries are super important in these kinds of systems. And that ties very quickly into vector questions and AI questions, and for spam fighting this kind of stuff is super important. Low data latency is super critical for spam fighting. Not all problems require real-time, so then obviously we're maybe not the best solution. If you’re querying yesterday's data, we may not be the right thing for you. Anyway, that's the short version of my life story.

BP Very cool.

RD As somebody who's had an email address for 20+ years, the battle against spam, how has that changed over time?

LB It's funny because email spam in particular is its own subdomain and sub art that I actually don't know a ton about. I mean, I know a lot for an outsider, but I wouldn't want to answer your question like a truant. I never have fought email spam. Obviously Facebook didn't have email spam, that wasn't a thing that we dealt with at Facebook. So I'm dealing more with bots and account takeovers and this kind of abuse and posts, these kinds of things more at Facebook.

RD Well, we have a blog too and we get a lot of comment spam. Is that more it?

LB Yeah, so that's much more my life. Yes, comment spam is much more of the things we've dealt with.

BP I want to hear about battling comment spam or signup user spam, but I also want to say that my favorite, however many years it lasted, it didn't last forever, was when Facebook decided that it would cost a dollar to send anybody a private message because that would obviously remove a lot of the ability to do spam at scale. But as a journalist, that was the best thing that ever happened to me because I could reach out to any potential source for a dollar and get right into their inbox. It was heaven. I was loving it.

LB I have no comment. It's very effective. If you want to talk about the theory of spam fighting, literally costing spammers money– very rarely directly, usually it's indirectly costing them money– is how you beat spam. It is a financial war you're in. So obviously charging them a dollar to send spam is extremely effective at killing spam. It's the most direct way possible.

BP So tell us some of your favorite stories or anecdotes from your time battling chat or bot spam, and then we can take it from there maybe to some of what you've been working on at Rockset for clients and the lessons learned from that past and how they apply now.

LB Here's what I would say about spam fighting. The people who do it, the people who are there dealing with it all the time, it's this weird thankless work where the more successful you are, the less people know about what you're doing every day. So you're built a little different if you love that stuff. And again, I got to build the infrastructure more so. In some ways I get to build stuff so I felt lucky in a sense, where they were just tearing things down. They were tearing other people's crap out of things. My favorite story, I think, is that there was a very successful attack that had happened over some years. I mean, at the time it was successful, and they built this very clever– it looked like YouTube. They had a way to make a scam that looked like YouTube. And I was wondering how anyone could fall for this, and then when I saw the link and I saw the page it went to and I saw how it worked, I was like, “Oh my God. Anyone would fall for this, it's so clever, the scams.” I remember having this very visceral moment of thinking this is just enough where I might not fall for this, but literally anyone in my family would fall for this, it was very clever. And this is back in the days where you could do something called Self-XSS. I don't don't know if people know about this kind of stuff. So you've probably heard of XSS, but you probably don't know about Self-XSS. That's where they would trick you into pasting JavaScript into your own address bar and running it. And they would be like, “Hey, prove you're not a human. Click on all the robots. And then when you're done, press Control A, Control V, Enter,” which selects your address bar and paste, and then you'd run a bunch of JavaScript in your address bar and now you're hacked, your account has been taken over in various ways. It's extremely clever and I remember just looking at this going, “Oh, on a bad day, I bet you a lot of people would fall for this, even savvy people would fall for this.” So browsers have improved a lot here, by the way. This is why you have to enable development tools in your browser. It's not on by default because it's actually pretty easy to prey on people via development tools if they're on by default. So that's one story. We could do more, but that's a good one.

RD You're right, the browser has improved. It seems like there was a lot of attacks that were just giving you a wonky link that would run SQL or JavaScript or something and that shouldn't happen from the address bar.

LB Yeah, exactly.

BP I don't understand how the unsubscribe flow is not constantly exploited. I feel like every day I'm trying to unsubscribe from something that I'm tired of and I end up at a random website and I'm clicking a bunch of buttons and then hitting “Okay.” And what if the whole unsubscribe was just a flow to get me to accept something or download something. I go so mindlessly through all these unsubscribe flows.

LB So I haven't kept up to this, but there was a period of time where unsubscribe as a general rule was the opposite of that. It was a subscribe button, unsubscribing meant your email was active. I'm sure this is still true with relatively shady parts of the internet, but if you go click an unsubscribe link, oh, that's active. Congratulations, your email address is worth more now when I sell it. And so I'm wary of unsubscribe links. To this day, I'm still wary. I don't know how it feels today because I don't click them anymore. I don't click unsubscribe links.

BP You're just a block and filter as spam.

LB I just filter, I filter.

RD So with the new vector databases and generative AI, obviously that makes it easier to kind of understand what spam looks like, but are attacks getting more sophisticated? Is it getting harder? You're able to squash one attack and they get a little better?

LB So my take on this is that because I have this background, I've been talking to people here now where I get to talk to a lot of other companies who are trying to deal with problems like this or fraud or spam or things like that in this space. And the thing I quickly realized is that this is extremely bespoke. Meaning your website, in this case, it doesn't have to be but in general, has a flow and there's a reason spammers are there. The tools and the methods and a lot of things are the same, but it's very customized and the only reason that makes sense is because a lot of these places are big enough now where the financial rewards are enough that it makes sense. There's literally a person somewhere writing code to trick your users into doing what I want them to do to make me money and that works. So that's scary I think because it's not so easy to generalize always. Obviously there's places where it's very easy to generalize, but not always. So in that sense, I think it's extremely sophisticated. But I'll be honest, we’re talking 10 years ago now that we're talking about the Self-XSS stuff and this tech I was talking about earlier, that stuff was shocking to me. I was like, “Whoa, this is way beyond. This isn't just shady.com don't click the evil link. This is way more clever than that.” So I think without a doubt it's more sophisticated than it's ever been. And to be fair, the tools to fight it also are, and it's this interesting game where I can kill 99% of it and my site still feels gross even if I'm wiping out 99%, so it's a tricky problem.

BP Yeah. I was listening to a story yesterday about the idea that gen AI might be a powerful tool for enabling a business email compromise because you could feed it a big spreadsheet of LinkedIn profiles and it could send each person a more personalized and targeted email versus one generic spearfishing. But it feels like the opposite is also true. I'm sure with, like you said, real-time data analytics, you might be able to spot an influx of similarly worded things arriving to a system at the same time and therefore alert people.

LB I'm terrified of this. We're in a brave new world because I don't even know what spam is anymore in this world. At some point your spam is so personalized that it's useful. Then what is it even at that point? I don't even know what that is.

BP Right. Spam kind of implies that it's automated, but what if each one is personalized? It's not even spam.

LB What if you spam comment useful comments to my blog? What does that even mean? I don't even know. What if you drown the internet in useful content? Some of the very fundamental principles start to break down. I don't even know what we call this anymore.

BP This is a fair exchange. You make my blog look very popular, we have a lot of really engaged and thoughtful commenters and they all have at least one affiliate link, but power to them.

RD I've seen in the last few months spam comments go from, “What a great blog and then a janky link,” to having a gen AI summary of the article. And it's obviously not somebody actually commenting, but you're right, they're going to get to a point where it's just giving me a good comment.

LB Exactly. And you can imagine at a site like Reddit, it's very easy, I would imagine, to build accounts for free that have very good reputation. Historically, if I was Reddit –obviously I never worked at Reddit– but I would be tracking the user's reputation over time. New users or users that write low quality comments, I can action them much more aggressively than people who write big, beautiful things that get lots of upvotes. But now I suspect it's super easy to build a relatively decent behaving robot on Reddit, by the thousands. So I'm terrified.

BP This is a very interesting idea. It's the benign bot who's cultivating. It's kind of like the sleeper agent cultivating good karma for a while before you activate them.

LB Exactly. It's cheap. It's cheap to increase all my robots’ reputation and then have them go evil. They go evil later. All that means is that I plant seeds now and then every day I'm planting new bots that are accruing reputation and I'm harvesting old bots with high reputation and I can do that forever. So that's a very straightforward model that is terrifying. I'm worried about us.

BP So in the pitch we talked a little bit about vector search. You're at this unique intersection of real-time data infrastructure, high powered anomaly detection, machine learning. Tell us about some of the things you've been working on at Rockset that you think are interesting that you can discuss publicly. What is the tech stack? How are you helping clients? Just a little bit of the things that are motivating for you. And it's also interesting, is this your first gig as a VP of engineering or was that something you also did at Facebook?

LB I was Director of Engineering at Facebook for a bunch of years, so I've been doing various kinds of management and team building and stuff like that for a while. To answer your other question, Rockset was born in a real-time analytics space, and in the world of real-time analytics, you have data that's flowing in constantly. I want to query that data, I want to query fresh data, is the real-time component of this. Often I want to query it quickly, so I need to index this data. So indexing and query freshness is a tricky trade-off. The more I index your data, the longer it takes for that index to materialize and make it queryable. And so the heart of real-time is trying to get all this correct and the trade-off right so your data latency is really low, your queries are very fast, hopefully you don't spend any money. Obviously there are different architectures in this space, so you can index data slower and more efficiently and build more of a throughput optimized system, but then it ends up so you get kind of the extreme big data systems, the big queries or Hives of the world where you can query yesterday's partition, and those are built to be very efficient per byte in the way you're running them. And so we live in the space of real-time, so we're up here where if Hive is way too slow, you want to query the last minute's data, the last 10 minute’s data, that's the space we live in. Rockset also has been built from the beginning to be very mutable, meaning a lot of these systems are append-only. You dump logs into them, there's no updates or upserts. It's only inserts, basically, insert-only workloads. So mutability is very important within Rockset as well. And this matters, so we're going to get to this when we talk about vectors because vectors actually are kind of an extreme version of a lot of these problems, meaning– I don't know how much people know about vector algorithms, but vector search indexes are extremely, at least historically, these very monolithic static things. You take all your vectors, you organize them, and it takes forever. If you take a million vectors, it takes you tens of seconds or more of intense CPU work to organize all these vectors into some structure. You can now search that structure relatively efficiently, but if you add a new vector, there's no way to add a new vector to it and preserve the goodness. Every time you add a vector kind of naively to it, it ruins this fast search property.

BP You have to reindex.

LB So you have to reindex. And that's this other big, giant, expensive, asynchronous thing. So it's actually like the real-time indexing problem on steroids. It's literally the same problem of, “Well, I can index more frequently. That's going to be more expensive, or I can try to come up with ways to do incremental indexing so that your fresh data is available quickly.” This is what real-time databases are, this is what we do. A vector database built just for vectors and a real-time analytics database, they're way more similar. You were naturally led towards very similar architectures for the same reason. One of the reasons why real-time databases are adding vector search capabilities is because it's a natural fit to the architecture we already have in place.

RD So what is the sort of architecture that you had that lets you do that really fast– reindex or append?

LB So the first thing to understand is that Rockset is built fully managed. So we get to be clever because you don't have to manage it. We're not sending software to you, we're managing it. And so things are split up. There's multiple services involved with your Rockset instance and we're running fully managed. And so basically you have an ingestion pathway, and there's a compute storage separation, which is the first step of this. So in other words, when you're ingesting data, there's a bunch of compute happening and it's writing to disaggregated storage. We also have a compute-compute separation. So in other words, you are ingesting into one pool of compute machines that are writing to storage, but you can be querying with a different set of machines. You can actually have more than two. You can have many compute tiers aimed at one storage tier. So a big part of this process is that I have essentially –to be fair, I don't have to set it up this way but it would typically be set up this way– you have dedicated ingest-compute. So essentially I have the ability to run ingest-compute however I'd like, and it's to some degree isolated from any query compute, and therefore I can do this idea of– if I have to do an asynchronous, I prefer not to do asynchronous rebuilds of the index, but if I have to I can, because it won't affect my queries. I have this big giant batch job I need to run on my database. I really don't want to do that during query spikes, for example, so let's run that in the evening. All these problems don't exist in Rockset. We can do this continuously and we can manage the ingest CPU to be throughput optimized if necessary, while keeping the query CPUs latency optimized, because these are in different pools of machines. But, that's a short version, I think. That's one of the things we do to try to make the ingest as good as possible. Again, I hinted at the theory here which is this throughput versus latency optimized systems. It's the same system basically, but sometimes you want to optimize for throughput and sometimes you want to optimize for latency. Often queries you optimize for latency and for ingest you’re optimizing for throughput. It doesn't have to be that way, but that's typically one of the ways it gets set up, and that's what lets me do these incrementally. We also, for whatever it's worth, have put an enormous amount of effort into not needing to do asynchronous rebuilds of indexes. So incremental indexing insofar as you can do it, we do as aggressively as possible. And so this often becomes a very interesting algorithmic challenge. If I have a bunch of strings flowing in and you want to do fuzzy string matching, how do I incrementally index new strings so that fuzzy string matching can happen? Vectors present a massive problem in this space. I actually think it's one of the very hard problems in vector search that people don't appreciate, which is the incremental indexing problem. So we spend a lot of time as well making sure that incremental indexing is as first class of as citizen as possible.

RD So what would help people understand the incremental indexing problem for vector search?

LB There are different ways to answer this. For computer science people, my answer here is to imagine you build a binary search tree, a balanced binary search tree, and then you want to add random new elements to it. What will happen is it will become unbalanced. This is the moral equivalent of exactly why vector search indexes can't be incrementally updated. Basically the tree, so to speak, becomes unbalanced and therefore you go from a logarithmic lookup time to a linear lookup time. That's roughly the problem with vector indices. It's actually worse often with these vector lookups because higher dimensional spaces tend to degrade faster than lower dimensional spaces– the so-called curse of dimensionality, so it actually deteriorates quicker often. I don't know what quicker means necessarily, but in some sense quicker, and this is why it's very hard to incrementally index most of the state-of-the-art vector search algorithms. We should be clear, by the way, I don't think we made this clear. When I say vector search algorithms, I always mean approximate search. Exact vector search is actually a known impossible problem. In low dimensions, you can do space partitioning and you can get faster. Sorry, to be even more clear, I can always scan all the vectors. I can always do a linear lookup. I can compare you to every other vector and decide who's closest. So that's the baseline, brute force. To do better than that, in low dimensions for vectors you can do better, in high dimensions you really can't. You can't do better. Again, curse of dimensionality. Basically for high dimensional vectors, your only chance of doing better than brute force is these approximate algorithms, these approximate nearest neighbor algorithms, and these are the ones that will deteriorate very quickly with incremental additions.

BP This is really interesting. Actually, Ryan and I were just on a call yesterday with some of the folks at MongoDB talking about adding vector search to Atlas, and they were discussing a lot of similar problems. They were saying, “We can't use K-nearest neighbor. We have to, as you pointed out, do something that's a bit of an abstraction of that.” And then also I think maybe to a bit of what you were saying that one of the benefits we're trying to add is that you can have two different kinds of data sitting side by side. Your vector search, which as you point out, takes a really long time to index, but then maybe also some of your ordinary unstructured data. And so you can have this mix of semantic and lexical search, and it sometimes is actually more effective than adding every new thing as a vector.

LB 1000%. In fact, I have a screed that I need to write up, I want to make up a blog post. There's at least two really hard problems in vector search, and this is after you solve the problem of just searching back. There's this really fun algorithmic PhD problem of approximate nearest neighbor searches and vectors. But even after that, even if you go download a really cool hierarchical navigable small worlds library and you solve that problem, you run face first into two very hard problems that I think are super critical. The first is this incremental index one, the one we've been talking about, and the second is precisely the one you mentioned, which I think mostly in the industry is being called metadata filtering. I think us and MongoDB probably wouldn't call it metadata filtering, because that's just what we do. What we already do is filter your data. The where clause of a SQL query is filtering, that's what it is. But in a vector world, it's this additional thing you add on where I have a vector, but that vector has some metadata associated with it, and I want to do filtering based on that. Give me the 10 vectors closest to this one where the price is less than 25 and the country is USA. You want that kind of a query.

BP The example from MongoDB was, “I want yummy restaurants that have this ambience, but the filter is New York,” and New York is just not a vector at all.

LB And that problem is super hard for the same reason we talked about earlier, which is that you have this prearranged index and it doesn't know about these filters. And so you have this very often common problem, in this case with the restaurants, of, “Okay, give me the hundred nearest restaurants to this ambiance. Okay, but now filter that to New York.” It's like, “Well, none of the ones I found are in New York.” And so you have to over fetch. So it's like, “Okay, well a hundred wasn't enough, so let me ask that thing for a thousand. Hopefully I get some New York restaurants.” Okay, well that didn’t work. Let me ask for 10,000. Maybe I'll get some New York restaurants. So combining metadata filtering with vector search, this kind of sometimes called hybrid search, I actually would go a step farther because you said it was sometimes effective. I actually think it's always better. There's very few situations where the product experience isn't better by having these kinds of guardrails. If I go on Amazon and I say, “Show me stuff under $25,” I don't want to see something that's $35 even if the vector says I really love it. I mean, maybe that's a good experience but it seems weird to me.

RD Yeah, it seems like an upsell.

BP Trust us, you'll want to splurge on this.

LB Right. So I think this metadata filtering problem is not to be taken lightly and I actually think this is important. I think that if you think of it as an afterthought for your system, it's not going to be good. It's not an easy thing to bolt on later. It needs to be something you build with first principles. And for Rockset, this is baked right into the heart. For us, a vector search index is just another index in our system. So you write full SQL, you mix and match your vectors and your SQL query, your conventional type SQL where location equals New York and price is three stars instead of five stars, and all those index participate together and we have our optimizer that will choose them appropriately. You get into clever situations where vector search is actually the wrong thing to do. Let me give you a really interesting example that will show you how databases can mess up really quickly. Imagine I have a database with a million restaurants but there's only five in New York, and you said to me, “Hey, show me the coolest restaurants in New York.” If I go do a vector search, it's actually pointless because there's only five. So the correct way to run that query is to just do the simple SQL query that's like, “Hey, give me the five restaurants in New York and I'll rank them myself and show them to the user.” I don't need any vectors for this. I can bypass the vector entirely. So there's this hidden hard problem here about how do you optimize and use different kinds of indexes to get the queries you want. So this is all stuff that Rockset is built to do very efficiently, and these problems are very hard and they're very traditional database problems when viewed through this lens.

BP I like that a lot. Ryan and I had an interesting conversation with some of the folks who were working on OverflowAI, similarly talking about how we should do search, and one of the things that they were saying is that we always want it to be hybrid because there are a lot of instances, like if somebody comes to me with just three keywords, where vector search is probably not going to give you a good answer. Or they're dropping in the exact text of an error message and they want to go straight to that question. And so, to me the most interesting thing you said was how do you design it so it knows when to use which part so it's adaptive in the right way.

RD The metadata problem seems like it wants to do both. You want a lexical search and a semantic search.

LB If you take this exact same discussion we've been having with restaurants and you apply it to tech search, it transliterates almost exactly. So keywords or sub strings and these kinds of searches with semantic search is exactly the same metadata filtering plus vector kind of thing. A semantic search is typically, but doesn't have to be, powered by some kind of vector nearest neighbor lookup, but you're always going to want this. “Well, okay, but I want a specific keyword, or I want this phrase.” That's exactly hybrid search and metadata filtering in a vector context. And it's interesting because if you go talk to database PhDs or people who've studied databases for a long time, this is an extremely hard and well-studied problem in databases. I have an inverted index, I have a column index, and now have a vector search index and a fuzzy text index. How do I choose them? How do I reorder the operations? So join reordering and all these other things that database optimizers have been studying for 30 years at this point. You have a new player and they know how to do this. So if you go look up cost-based optimizers, you'll find pages and pages of scholarly work, and every big database has one. So databases have a theory here, they know how to do this. They'll do things like selectivity estimates, so they'll apply the most stringent filters first so that later parts of the process will not do as much work, and so then they'll build selectivity estimators for your vector index. And there's massive piles of theory. So again, I guess my point in all of this is that this is a really hard problem itself, it’s also a very interesting and hard problem, and it's one of those where this is exactly why bolting it on doesn't work. If you bolt it on, you've just wandered into a massive database problem, you just didn't even know. So someone will wander in and be like, “Look, here's the database textbook. Here's a grad class that you should have taken,” and get to work, because you accidentally have wandered into a known hard problem.

BP If you want to give an example from any of the case studies that are out there of something you've worked on recently that you think really highlights that sweet spot of that real-time data infrastructure and anomaly detection, I would love to hear it.

LB Yeah, so let me give you a very simple example of something that I think highlights vectors and some of these hard problems we’ve talked about, and something that Rockset is very good at and that I'm proud of. There's a company called Whatnot; Whatnot does buying and selling and they have a very straightforward problem. The right way to think of Whatnot is that they're doing livestreams of buyers and sellers. So somebody is literally showing off something on the internet and you can talk with them, interact with them like a Twitch stream or anything like that, but it's buying and selling. It's auctioning, but live. They have a very concrete classic recommendation problem: “You might be interested in looking at this streamer,” because that's trying to match a buyer to a seller. That is a vector search problem born and bred. Show me things like this thing, recommendation systems. But they have this very real-time component, a metadata filtering component where I need the stream to be online at the minute. It's not good to show me something that isn't online, that defeats the whole purpose of the site. So they have this very simple setup of, “Do this vector recommendation, but also do the metadata filter where the user happens to be online at this moment.” Their actual system is much more complicated than that, but that is the very heart of these hard problems all getting merged together. I need vector search, I need metadata filtering, and I need real-time updates that is online. It can't be yesterday's data, it needs to be now’s data. And so this is a very simple example of something where all these hard problems come together in a way that is super difficult to solve with almost anything. This is the sweet spot for us. If this ticks all the boxes it’s a perfect use case for Rockset and one that works really well that would be very difficult to make work with almost any other system. So I'm proud of that one and I think that highlights some of what we've been talking about.

BP Cool.

[music plays]

BP All right, everybody. It is that time of the show. We want to shout out a community member who came on and helped spread some knowledge. A Lifeboat Badge was awarded seven hours ago to user 7610 for helping to save a question from the dustbin of history. “C++ application terminates with 143 exit code. What does it mean?” If you've ever wondered about that 143, as 30,000 other people have, well then user 7610 has you covered. Congrats on your Lifeboat Badge and spreading some knowledge around Stack Overflow. As always, I am Ben Popper, Director of Content here at Stack Overflow. Hit me up on X @BenPopper. Email us with questions or suggestions, podcast@stackoverflow.com. And leave us a rating and a review if you like the show, because it really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog, and you can slip onto my DMs on X @RThorDonovan.

LB And I am Louis Brandy. I am from Rockset, rockset.com. And thank you guys for having me.

BP Of course, thanks for coming on. All right, everybody. Thanks for listening, and we'll talk to you soon.

[outro music plays]