The Stack Overflow Podcast

How Stack Overflow fends off scraping bots

Episode Summary

Josh Zhang, a staff site reliability engineer at Stack Overflow, tells Ryan and Eira how the Stack Exchange network defends against scraping bots. They also cover the emergence of human botnets, why DDoS attacks have spiked in the last couple of years, and the constant balancing act of protecting sites from attack without inhibiting legitimate users.

Episode Notes

As Josh explains, DDoS attacks aim to take down a website, while bot scrapers try to gather as much data as possible without getting caught.

Josh Zhang is a staff site reliability engineer (SRE) at Stack Overflow. Connect with him on LinkedIn.

ICYMI: In 2022, Josh wrote an article for our blog about how Stack defends itself against DDoS attacks.

Stack Overflow user Serge Ballesta won a Lifeboat badge for answering What does |= mean in c++.

Episode Transcription

[intro music plays]

Ryan Donovan Welcome to the Stack Overflow Podcast, a place to talk all things software and technology. My name is Ryan Donovan, I edit the blog here at Stack Overflow, and I am joined by my colleague, Eira May. How are you doing today, Eira?

Eira May I am doing pretty well. How are you?

RD I am doing well enough. Today we have another member of our Stack Overflow family in the house, Josh Zhang, who's a Staff Site Reliability Engineer. He's going to talk about how we've been defending against all the bots trying to scrape the site. So welcome to the show, Josh.

Josh Zhang Thank you.

RD So you wrote a blog post a while back about handling DDoS attacks, so I assume you're kind of our man on defense. Can you talk a little bit about what your day to day is here and how you got to that?

JZ Sure. So I'm one of the reliability engineers for the public platform. So historically, Stack had two teams for reliability engineering: one that we used to call the cloud team, in that they handled Stack Overflow for Teams, and then there was another team that handled the public platform. So basically, the SaaS product, or I think Ben, one of the directors for engineering, actually said it very well in that if you're paying us, that's one team, and if you're not paying us, that's another team. I'm on the ‘not paying us’ team. But mainly it's because the site is hosted in a physical data center. So I'm in New York and I oversee the New York data center, and that hosts Stack Overflow, the SE network, and things like that, and I'm on that team. We are moving to the cloud and there's a lot more involved so the teams are blurring, but that's how I started. And as far as the DDoS stuff, Stack always apparently had DDoS attacks historically. I remember talking to some of the founders and the OGs while they were still here and I would ask them, “Hey, did we used to get attacked this much?” Because I think it was around 2022ish that we had a string of pretty public site outages because of DDoS attacks. And I remember around that time, I was asking, “Did y'all get stuff like this before?” and they were like, “Yeah, almost every other day, just they were small scale.” They would just be maybe a troll or individuals, so nothing too crazy. They might have stuff like a low orbit ion cannon, which is kind of a DDoS tool for your computer and you can deploy it on a couple– but individual scale, nothing too bad. But starting in 2022, we started getting big botnets. One of the attacks actually almost broke Cloudflare's record for total throughput. I think the record at the time was –don't quote me– it was 150 or 50 gigabits. Don't quote me, but we were 147 or 47. We were just under, and that was something that we just got hit by. And the tools originally at the company were developed in-house to mitigate DDoS, so they weren't quite up to the challenge of dealing with large scale –I wouldn't say a state actor– I suspect some of them might have been, but basically very large scale botnets, because they get really wide. So that's kind of how I got tossed into the deep end of basically just getting good at defending, understanding the traffic patterns, and things like that. So we matured our processes and my team basically deprecated the in-house tools, started using more off the shelf commercial stuff from our CDN and things like that. Now we're on CloudFlare and their DDoS protection suite is pretty much one of the best in the industry. And that's helped a lot, but also internally, we have to do other things because it's not just blocking certain traffic. Another unique thing to our site is that because our users are technical, a lot of them will be using code against the site instead of just browsers. And if we were just to completely block that, well, we'd have a lot of angry community members. So it's just finding ways to kind of protect us while basically still allowing the site to be the site, which now brings us to scrapers. Nowadays, it's interesting because malicious traffic isn't just DDoS anymore. Now it's scraping against your consent or even creating fake email accounts and things like that. And that's all now, I would say, under the same umbrella to an extent. Because DDoSs for the most part just hit your site with a bunch of traffic and they might have tricks here and there to kind of make it more effective, but I'm going to shoot myself in the foot when I say that that's old news. It's not solved, but we've gotten used to it, and if one big enough to actually take us down again comes, we have mitigations and we kind of know what to do.

EM It's a known quantity, is that right?

JZ Yeah. We've done enough of it now, and I think the Internet as a whole kind of understands where there's multiple layers of protections that everybody has that basically can kind of deal with it. In worst cases, you might blip for a minute or so, or you might slow down, but generally speaking, unless it is literally a state actor using the might of their government to try to bring you down through DDoS, you can most likely mitigate it within a reasonable amount of time where you don't have too much impact. But now there's just scraping, even recently what I call a human botnet where people are creating fake user accounts, and it used to be through scripts, but that's why the internet created captchas because that prevents scripts and bots from creating accounts. Well, it's an arms race now. There's tools that circumvent that and some of the more low-tech approaches will have just literally a person in front of multiple tablets and just click, click, click, click.

RD So the DDoS attacks basically try to transfer as much as possible, but the bot scrapers are also transferring a lot of gigabytes and data. Is there similarities or are there patterns that you notice that differentiate them?

JZ Yes. Well, a DDoS tries to take you down by sending you a lot of traffic. The bot scrapers want to send you as much traffic as they humanly can without getting caught. Because if they bring you down, of course they can't get any data, and they need to be quick because if you catch them then you're going to probably block them somehow and basically it stops them. So bot scrapers are interesting in that it seems like there's a lot of fly by night companies that are just selling their services to the bigger companies and they just scrape on their own using code that they've written here and there. And I've seen one company that charges a percentage of a cent per page, and there's big companies that will buy it because if you think about it, a lot of companies and a lot of public sites basically flat out say, “You're not allowed to scrape our traffic, either through the traditional means of a robot.txt or other things,” and the big companies don't want to expose themselves to potentially either bad press or maybe a lawsuit. So the easy way around it is, “Oh, I'm just buying my data from this other company. I don't know where it comes from.” So I think before there wasn't a lot of money in that, but now of course there is with the AI boom and large language models and things like that. So I am seeing a lot of smaller companies that just pop up that all they're trying to do is scrape as much data as they humanly can so that there's profit to be made. And ultimately, it just means that, interestingly enough, Stack Overflow, normally we have historical traffic data that shows on a regular day how much we get, and in the past year or so, it's increased by quite a bit. And you kind of go, “Wow, this is great. Site traffic, more users. This is awesome.” But no, no, it's all mostly just bots.

RD That's right. The robot invasion has started.

JZ Right. And stopping them is not straightforward. There's this thing called JA3 fingerprinting. It's really interesting in that the people who created it, their initials were JA and there were three of them. What it does is it basically tries to identify groups of browsers using signatures and things like that. And originally it was used for anti-DDoS because a very expensive botnet will have a lot of IP addresses and they might only need to send, let's say, five requests per IP to take you down. That's not enough to rate limit based off of IP address. JA3 was created to kind of mitigate that because it spans across IP addresses. If you're running a botnet, they'll generally all have the same JA3 fingerprint, and that way you can say, “Okay, I'm going to limit based off this fingerprint,” and then you can respond quicker with rate limiters and things like that. Well, the same thing for DDoS still also applies for these scrapers, fortunately, in that you're not going to go through a ton of effort to create unique multiple agents that do scraping. You're going to write a set of code that does work and then you're going to deploy them out into probably a cloud provider, and then they'll go do their work. That's all traceable. They're all similar or the same fingerprint on JA3. What's interesting though is, like any arms race, now there's tools to try to mitigate that, because of course that's how it works, and now there's a JA4 that's trying to mitigate that further. And I'm already seeing it because there will be botnets that have multiple JA3 signatures but fortunately they have the same JA4. This used to be something you can kind of set and forget for at least maybe a quarter, maybe half of a year, your mitigation methods, but now, man, every month or two, if you're not constantly on top of it making adjustments, people will slip under the radar, whether it is DDoS or scraping. And especially with these smaller companies, they're scrappy. There's money to be made.

RD So with all these signatures and the bots working around trying to look as human as possible, what are the ways that you're able to differentiate them from actual reasonable traffic?

JZ Besides just plain fingerprinting, I think the biggest thing is having very good logs and very detailed logs, because you really won't be able to detect individual requests because another problem is that bigger, older companies that have their own ISPs and they run their own data centers like Facebook, you know where their traffic is coming from. You would immediately know. Cloudflare provides something called a bot score, where basically, based off their secret sauce, they can tell you likelihood of being a human or not. So if I were to write a very simple rule, I can say, “Oh, if it's coming from a Facebook IP address and it's very likely a bot, sorry, just stop that right there.” Reasonably easy. But all these new fly by night companies, they're running off of AWS, Azure, Google Cloud, and I can't just stop those because we run GitHub Actions. If I just stop Azure, our own GitHub Actions won't work anymore. So that's where deeper log analysis and trying to find patterns. One of the very obvious patterns, like I mentioned in the beginning, was that scrapers are trying to get the most amount of data as they can without basically getting caught. And you can try to randomize what you're scraping, but generally, if you look at it at the macro level, it's a pattern. And if you can identify that pattern, then you go, “Okay, this is definitely a scraper,” and then you target down on, “Okay, what am I going to use to identify whether it's coming from AWS and it has this fingerprint and it's this time of day and they're going to try to hit this rate,” or a combination of that. And that's where, sadly, this is still an art not a science. If it were science, we can just write an algorithm, turn it on, and call it a day. And it's also still compounded by the fact that our community will run scripts against our site, too, but luckily most of them are not running it at the scale of a scraper. So that's kind of where we're developing our anti-scraping efforts.

RD I know we have the Cloudflare protections. I think we wrote about some proxy protections. What other defenses do we have to recognize traffic and shoot down the bad guys?

JZ So for the longest time, Stack Overflow for quick response didn't use caching. Not much of the site was cached. We're leveraging that a lot more. That's more for DDoS, but also by adding more layers at the edge, which is basically our CDN, Cloudflare, it's also a good central point of ingest for the data itself. And we get very, very verbose logs from them into Datadog, which is our monitoring tool– so much so that it can actually start costing a lot of money just to analyze the data. But the good part about that is that we used to have probably three or four different sets of logs, either from the CDN or our load balancers or our origins, which is the web servers themselves, and then it's kind of trying to piece together the different pieces of data to try to look for a pattern. And depending on where it sits, it could even be a SQL query that will take maybe two or three seconds just to run, which is a huge pain. Now, at least with most things if not everything, having to go behind the CDN, we're able to kind of aggregate all those logs and create monitors. And based off those monitors, we can actually start creating more general rules on top of very targeted rules. So as a general rule, you want to work your way down a funnel. At the biggest funnel level, we'll basically try to target maybe a JA3 signature or maybe an ASN. ASN is a registered set of IP addresses to the known internet of where they came from. And you work your way down, all the way down to maybe an IP address. An IP address used to be what you would stop or block, but the thing is, it's too easy to kind of move between, and especially with things like AWS, you have a ton of IP addresses. So I almost don't rate limit or block based off that anymore because it feels fruitless, it becomes whack-a-mole. I think the more advanced stuff that I want to start developing is maybe a reactive block, in that based off certain patterns, it will just temporarily heavily rate limit certain traffic so that we don't have to watch it anymore. But the thing is, the people at the other end are always watching it and they're adapting, so I don't think creating that silver bullet is really tenable.

EM Seems like every advance that is made just sort of buys a bit of time until there's a workaround. Necessity is the mother of invention.

JZ It's really an arms race, and now that there's money in it, there's more people on the other end of it trying to kind of knock down your doors. So this is an interesting time. I read an article that Cloudflare is saying that malicious traffic on the internet is increasing to about 7% of all traffic now, and that's wild.

EM It seems like that has really spiked just in the last couple of years. Would you say that that's the timeframe where you think that leap has really been accelerated?

JZ Absolutely. I don't know what on the macro level is causing it, but definitely just generally malicious traffic on the internet has definitely increased. Because once we solved our DDoS issues in 2022 and 2023, it was quiet for maybe half a year, and then we had to kind of tweak things again. But now, like I said, maybe a month or two is as much as any one tweak can stop the bad traffic.

EM That's probably keeps you up at night, I would think.

JZ I find this interesting because it's an adversarial relationship, but there's another person at the end of it. So it's not outwitting, but it's thinking about what your opponent is going to do and trying to figure it out through all the logging and things like that. So it's an interesting problem to try to solve.

RD You’ve got a multiplayer game.

JZ Yes, and I do enjoy playing it and other things like that. Maybe that's the itch that it's scratching.

EM That's the connection. So we mentioned briefly the rise of Gen AI tools obviously perhaps contributing to some of this malicious traffic. Is that just because you think people are using it as kind of a force magnifier to help them build these malicious tools faster and help them adapt more quickly? Or what do you think is going on there?

JZ I can definitely see that, but I personally think it's a money thing. Thinking back in my career, I remember when viruses were just for trolls and maybe for infiltration, but then all of a sudden you could ransom somebody's data. I remember the first instance of ransomware and I was like, “Wow, this is the first time people can make money writing malicious software,” and then, of course, that exploded. Ransomware is just something we have to live with nowadays. And maybe that's just it. Right now there's money in scraping data and creating fake accounts for various reasons, and because there's money there, everybody wants in on that piece of the pie.

RD That totally makes sense. So the devil's advocate view, maybe somebody asks what's actually wrong with scraping sites? That's what browsers do every day. You basically download the web page locally. What is wrong with the scraping?

JZ So for us specifically, our license is a community license in that anything the contributors on the site create is free for other people to use but you must have attribution. So a lot of these ML companies that will scrape our data without our permission will then use it to improve their models and make money but there's absolutely no attribution. So on one end, there's, “If all the ML models out there basically use our data and put us out of a job, there will be no more people putting new data in and then the cycle ends.” But the other end is that we will have people that will scrape and then create clones– fake clones for all kinds of different reasons. One, if they just want to steal our traffic maybe and send ads, or B, they're using it as a way to actually fake Stack Overflow and then steal users’ data. So there's multiple angles on why people want to use that data and why we want to prevent it. Because the idea is that we're not trying to limit knowledge –that's the antithesis of Stack Overflow– but we also need to make sure that it's being reused properly, which is actually for the community use. With our data dump, we're trying to at least put language in there that prevents big companies from downloading and using it mainly without attribution. Because without that, not only are they making money on their own and not basically giving back to the site that's presenting it to them, but also eventually if we have zero traffic left, somebody else will pop up maybe, but we don't know. The internet is a different place than when the founders created Stack Overflow.

RD If somebody's just stealing all the data and not giving us credit, what's to keep people coming back here?

JZ Exactly. And if you think about it also, once we're fully in the cloud, every request costs a calculable amount of money. So if you're scraping it against our permission or without ads or things like that, well, you're just costing us money and we have upkeep– not just salary, but actually the cloud compute, because that's how all of that works now.

RD Obviously there's bots that help the internet run like search crawlers and that, but this is costing us money and it's not the human interaction that I think everybody on the site is looking for.

JZ It's interesting. I didn't realize this until we had a discussion with different peers in the industry, but Reddit themselves, robots.txt for them now is just disallow all, and they actually got support from the EFF on that. And that was an interesting move in that they're still providing the ability for good crawlers like the Googles of the world and things like that to crawl them, but generally if you hit their robots.txt and actually respect it, you're expected to not crawl their site at all. So that was an interesting move they made.

[music plays]

RD Well, it's that time of the show again where we shout out a member of the community that came in, saved a question from the dustbin of history and got themselves a Lifeboat Badge. Today, we're shouting out Serge Ballesta for answering, “What does |= mean in C++?” So if you're wondering what that obscure symbol means, we have an answer for you. I've been Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you liked what you heard, leave us a rating and review. It really helps.

EM And my name is Eira May. I'm a writer at Stack Overflow, and I just published a piece about ghost jobs. So if you have had experience being ghosted on the job market by a ghostly job listing, I would love to hear more about that. So you can find me online on most of the things @EiraMaybe.

JZ I'm Josh Zhang, Staff Site Reliability Engineer at Stack Overflow, and you can find me on Meta Stack Exchange.

RD All right. Thank you everybody, and we'll talk to you next time.

[outro music plays]