The Stack Overflow Podcast

One of the world’s biggest web scrapers has some thoughts on data ownership

Episode Summary

Or Lenchner, CEO of Bright Data, joins Ben and Ryan for a deep-dive conversation about the evolving landscape of web data. They talk through the challenges involved in data collection, the role of synthetic data in training large AI models, and how public data access is becoming more restrictive. Or also shares his thoughts on the importance of transparency in data practices, the likely future of data regulation, and the philosophical implications of more people using AI to innovate and solve problems.

Episode Notes

Or Lenchner is the CEO of Bright Data, a web data platform that offers ready-made datasets, proxy networks, and AI-powered web scrapers. Developers can get started with their docs here.

ICYMI, read our blog post about the knowledge-as-a-service business model and how it will guide the future of our paid platform.

AI answers alone aren’t knowledge.

Connect with Or on LinkedIn.

Stack Overflow user guizo earned a Populist badge by explaining How can I minify JSON in a shell script?.

Episode Transcription

[intro music plays]

Ben Popper Announcing AssemblyAI's new Speech AI model, Universal-2. With 21% higher alphanumeric accuracy and a 24% improvement in proper noun recognition, you get even more precise transcriptions. Start now with $50 in free API credits at assemblyai.com/stackoverflow.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, Director of Content here at Stack Overflow, worst developer in the world, joined as I often am by my colleague and compatriot, Ryan Donovan, blog editor extraordinaire, newsletter send guy, guy who sends the newsletter, although you don't even do that these days. I say that because it's not even your job anymore.

Ryan Donovan No, no. Historically.

BP Historically your job. We're going to have a great conversation today. We're going to be chatting with Or Lenchner. He's the CEO over at Bright Data, and we're going to be talking about what I would say is one of the existential questions in the Gen AI era: Are we running out of training data? Are companies going to need to use synthetic data? If they did, what would the impact of that be? Those are the big questions– are we going to run out of data, are we going to get to make more data, how are we going to do it? And then there's a lot of other interesting questions about, are there restrictions around how data is collected? What's there for AI or off-limits? The new robots.txt from the search era but now for AI. Where's regulation headed? And Bright Data itself has been involved in some lawsuits about this and has thoughts on what it means for your organization to protect yourself now with your dataset and your rules about what can be used. So Or, welcome to the Stack Overflow Podcast.

Or Lenchner Ben and Ryan, thank you very much for having me.

RD Of course.

BP Or, tell us just a little bit about yourself. How'd you get into the world of software and technology? Take us back to that first computer or that first line of code and then what brought you to the position you're at today at Bright Data?

OL So you said that you're the worst developer, at least at Stack Overflow, so you just found the next worst developer. I'm not actually a developer. I write very little code only when I have to and it usually doesn't work.

BP Okay, same.

OL So this is how I started. I'm Or Lenchner, I'm the CEO of Bright Data for around six and a half years, but have been with the company since 2015. I'm a product guy. This is what I like to do. This is what I've been doing for my whole career. This is what I'm doing today, just the product is a company, which is a complicated product itself, but this is how I perceive my job. I started just building stuff online, usually without the ability to actually build it so I had to pay someone else to build it, but I always came with the ideas. It went pretty well. I built a few cool websites that just started generating some revenue, and in roughly 2015, I joined Bright Data when it just was established by the two co-founders, just because I kind of saw that something is happening, these guys and the company that they built are solving a problem that is starting to be big, and it's going to be huge, which is access to data on the internet. First is access, then it's the connection, then it's the organizing infrastructure.

BP This is something we know intimately here at Stack Overflow. We're talking today– it's Thursday, October 3rd. We just published a big blog series all about our position on what we call ‘knowledge as a service.’ We have a knowledge community. They're creating new data. People are coming and asking and answering questions. This is now, we know, very valuable training data for AI models. We've publicly announced partnerships with OpenAI and with Google, training data for their next GPT or the next Gemini. And what we strive for, and as we say in the blog post, is a virtuous relationship. You pay us for training data, we invest that in our community, the community continues to create new knowledge, and when people get answers from an AI system, it's not a black box that just gives them an answer. It says, “This answer came from this user who answered this question on Stack Overflow,” and there's a link. So at least there's some attribution back to the humans. So we sort of tried to sketch out a bit of our own thesis and our own stance and policy on this. But Or, let's go big picture here. The question at the top of everybody's mind and just to sort of take your temperature, are we running up against data constraints? Are we running out of data? I know people are creating synthetic data for various reasons, but is that because we're running out or because it has its own advantages and disadvantages?

OL So I absolutely don't think that we're running out of training data or data at all. I think that it's just very, very difficult to map the internet. So when companies create their first LLM or envision their LLM, they first go recruit smart talent, then they go and buy a few GPUs to be able to have enough compute, and then they need training data. They all go to the immediate suspects, those open source datasets that everyone can take and start using to train. That's cool, that's fine, it's working. But then it's not good enough because if everyone is using the same compute and the same training data, eventually you'll get very similar results and then these companies need to find new sources of data. It's easy to think that I consumed the whole internet because you don't know what is the whole internet, and then you're starting to look at synthetic data. We see a lot of it in Bright Data. We actually have a very good ability to see and map the internet just based on the activity of our tens of thousands of customers. So actually I kind of feel confident enough to say that that's a fact– we're not running out of training data. We just don't know where it is. As for synthetic data, I do think it can work sometimes, usually when it comes to visual stuff and LLMs that are doing visual things. I don't see and I still haven't seen any evidence that you can create good enough synthetic data to train models that will write code, that will answer math questions. It's just not there. It's there to create a nice image of three people doing a podcast.

RD I think seeing some papers talk about synthetic textbooks and creating quality targeted LLMs with synthetic data, but I've also heard about things like model collapse where you're training on AI-generated data and the model falls apart. Is there a sweet spot for that or is synthetic data always going to sort of lead to a faulty model?

BP And just to put a little more context on this, both Ryan and I have written– Ryan wrote a great blog post for us about this, and I'm working on another– about this seminal paper in the field, “Textbooks are All You Need,” where they say, “Look, you can train a model on synthetic data based on all of these accepted, well-known and clearly validated Stack Overflow questions and answers, plus this really clean code base, and that model with less data and less parameters is equally good as a much larger model that's using a bunch of open source GitHub data that's full of bugs or missing context or has dependencies.” And so we're sort of intimately familiar with this one just because it happens to use Stack Overflow and kind of shines a positive light on us. Oh, okay, if you use Stack Overflow data which is well-organized by human beings, kind of has reinforcement learning with human feedback built-in because people bothered to vote and rate on these answers, then you can build this Phi model that's actually quite good at coding or math, even though the size of the model and the size of the training set say it shouldn't be so good.

OL I absolutely agree. Maybe we should first define what synthetic data means, because what you just described for me, that's perfectly fine training data. It was created by professionals, it was curated, it was validated. That's absolutely fine. Actually, I agree. I see models and I see also from the requirements for Bright Data in giving me this dataset, I understand that the quality and the accuracy of the data in many cases is way more important than the scale of the data. So we’re totally aligned with that. When I'm talking about synthetic data I'm talking about the machine that is creating something and validating it itself, and then you're going into this infinite rabbit hole where only bad things can happen.

BP Right. So this is the other thing Ryan referenced, which is model collapse– this idea that, for all we know, more and more of the data that is being scraped– blogs or images or audio or otherwise– is now AI-generated. We know that the internet is filling up with Gen AI generated content and that if that stuff is not human-validated, if the intention is quantity over quality or speed, then it's garbage-in garbage-out. The more poor quality synthetic data in the training set, perhaps the poorer the overall outcome when you're finished training the model.

OL We see that as well. This is why it's not just about collection of data today, it's also about validating data, it's also about annotation and labeling. So sometimes if it's annotated and labeled right, even if it's synthetic or even if you have no idea if it's synthetic or not, it can work. These are absolutely challenges that we'll see, I think, in the next couple of years. Maybe regulation will catch up. Everyone and every government is talking about it and also the big players are trying to be involved, especially to kind of set up the rules in advance. It's not in advance, it's actually too late I would say, but to try and tackle all of these issues. From our point of view as a major data supplier, we get much more specific requests than before. So if it was, and it kind of still is but it depends who the customer is, “Just give me everything that you can because I just need to consume as much as I can,” we see a lot more of, “I have a working model now. It's good enough. Now I need to fine-tune it, and to do the fine-tuning, I need this specific data from that data source. Help me get it.” We're seeing more of that.

RD Like you said, it does feel like regulation is coming a little late. It's been a bit of a Wild West and a lot of folks with the data are sort of catching up and restricting who can use it and there's copyright lawsuits going around. Are there going to be issues for future LLMs with the potential of copyright infringement lawsuits with data suppliers putting restrictions on what you can use?

OL I think yes, but my belief is that it's going to be leaning a lot more towards the technology than the content creators. That's just an opinion. The reason that I'm saying that is because the technology exploded so fast in the last year and a half, I would say since GPT-3 when everyone started using it and understanding what it means, that it's kind of hard to go back in time and change things. The regulator in this case I think will have to be slightly more flexible than other instances in history just because it's out there and you can't control it anymore, so it's kind of trying to detain what's currently happening instead of trying to change the reality.

BP I want to back up for a second here because I want to get into the legal aspect and I know you've been involved in some legal cases. I'll preface this by saying that we should just say that you’ve been involved in some legal disputes with large social networks or large tech companies, since the companies we’re talking about, some of them are clients, but what is the history of the company? Because it's really interesting, when I go back to the website, it's starting out with IPs and scraping data for web proxies and scraping data for retail intelligence. And so I'm sure, like you said, Bright Data is very good at looking at and analyzing and updating global web scale data, but the initial idea and business had nothing to do with generative AI. So can you just give us a quick, how did the company start, how has it evolved, and how did that take it to a position where it can be deeply involved in this new conversation with Gen AI and the sort of titanic web scale data scraping they do?

OL Happy to do that. It's a good time to actually explain who we are and what we're doing. So as I said, around 2014-15 the company was established basically with one focus in mind, and that's still the company's focus, it’s just that the industries and the use cases are evolving around it. So our sole focus is to enable our customers, which is pretty much every company in the world, to gain access to publicly available web data. That's the tagline. What it means is that the internet is probably the largest source of data in human history and most of it is public. And when I'm saying public, I mean it in the most simple way– open a browser, go to a website, if you see the content, it's public. If there's additional content that you don't see and you have to be logged in, sign up, pay to pass a paywall or anything like that, that's not public. So we're talking about the public aspect of the data. Most of the internet is that– public– but it's actually becoming harder every day to access the public information. If you think about the real world, think about a public library, you should be able to go in, read a book. It shouldn't be hard. On the internet, it's extremely hard to do it in scale, to see the different content of the same webpage. So we're three people here. If I share a link and we all click it together, we will see slightly different content. We might see different ads. If it's a product page, we might see different prices, different reviews, different shipping times, whatever it is, just because we're different people. It's not driving the same highway and seeing the same billboard in the real world. And to understand what's going on in the internet from different perspectives and different points of view, that's not an easy thing to do, especially not at large scale, and this is what Bright Data is. So we started, as you said, with very advanced proxy networks that allow our customers, and for that product, the more sophisticated customers with pretty good technical skills, to reach a website from a different location around the globe, again, to see the relevant content that they want to see. On top of that, along the years, we've built more and more products and it really expanded our technical offering and product offering to help our customers to also, for example, run a headless browser that will execute actions on the website. So for example, if you want to do a search for a keyword on the website, you need to load JavaScript and you need to execute actions and things like that. On top of that, we have the complete data offering, so if you just want to buy a dataset that we connected and curated for you, or you want to code your own crawler on top of our IDE or whatever, it is really a broad offering. I'll just finish by saying that on top of that, we realized that most of our customers don't even want the data. They just need to get the data in order to get to the insights and the conclusions, so we also offer insights solutions. A year and a half ago, we realized that we kind of invested a decade in building the best infrastructure for AI companies to collect data from the web, even though we started with very different use cases as I explained. And so we saw that a lot of AI giants and today also many smaller AI companies that need data are coming to us to use our existing infrastructure to get the data that they need. And we also have this really unique ability to map the internet, just because of our huge scale. I’m talking about tens of billions of requests to the internet every single day.

BP You raised an interesting point– should the three of us have equal access to information if we go to a public library, Ryan and I live in the United States where people are always trying to take different books out of libraries, but that's a conversation for another time, but same for the internet– should we go to the internet and be able to find stuff? And I think it's interesting. I feel very passionate these days because some of what Stack Overflow was founded on is kind of at the heart of this. It used to be, “Okay, you have a coding question. You can either go to a blog where you're not sure if you trust it, or you can get to a paywall site where it's no longer accessible to everyone because not everyone can afford it. So we're going to create a free web forum where everybody can contribute and the knowledge will always be free. So, okay, great, now it's equally accessible to everyone.” But I read a recent report that was put together for some folks at MIT, basically saying that because of the intensive data scraping that started to happen from a lot of these AI companies, a lot of formerly open public datasets and websites are now being walled-off, and that when they look at some of the most commonly used and sizable datasets, they see 25% has been put behind some kind of barrier and another 45% has now been restricted by terms of service. And what does this mean for the health of everybody, because, can researchers access it, can the public access it, could a startup that doesn't have the same resources to pay a licensing fee access it? And so I think these are really interesting and important questions and I would love to get your perspective, Or.

OL I think that in the last year and a half, a couple of years when the AI boom started, there was a lot of oversteering to different directions by different stakeholders in the industry, just because everything happened so fast. And I think that the balance eventually will come back. We saw a lot of web scraping of public information that wasn't respectful actually, and maybe some will argue it wasn't ethical. We're huge advocates for ethical scraping. I talk a lot about that we publish a lot of papers around it, what's good, what's not okay to do. Even if there's no legality issues, there are still good and bad and best practices. We preach for that over every stage. We saw a lot of unfair, unethical scraping and the reaction was an oversteering from the website owners that just, “Okay, I'll block everything.” But then what do you do? You also block Google, they can't index your website anymore. That also doesn't make sense. Okay, so I'll just open everything again. So we saw these really, really hard shifts for the right, for the left, and everything was a mess. I’m kind of starting to see that it's getting calmer right now and the balance is being retained because it has to be a win-win situation. As you said, if someone is over-scraping the website without respecting the website and the contributors in the website, well, it won't last for long, and then what will you do? So it has to be a win-win situation. I think that some websites will actually be able to do this sort of rev share between the content creator to the actual company that wants to consume it and the website in the middle. I don't think it will be the common practice that we'll see. It's complicated. I think that we'll see more, as you mentioned, you also did Stack Overflow direct licensing deals, but actually, I think that going back to the basics is always the safe place to assume that everything will land eventually, and that's public versus nonpublic. There is a real meaning to the term ‘public information.’ It's either public or not. Everything in between is really meaningless. So there are risks or challenges for the business to put all the data onto public domain because maybe someone else will use it because it's public. Fine, you can decide that it won't be public. And I think that the two court cases that we won in 2023 kind of proved that pretty much for the first time, and it was in California, so an important venue for these discussions. So everything in the middle can work if it's licensing the content or finding a way to compensate each other and communicating between the data owner and the company that wants to license data, that's fine, but eventually the borders are public or not public, and the websites need to decide and they can also experiment.

RD So do you think the regulation, when it does finally come, do you think it'll help resolve these disputes or do you think it'll get in the way and make things worse?

OL I think it will definitely help resolve. A similar but different situation was with private information on the web –PII– and a few years ago the GDPR in Europe and to some extent the CCPA regulation in California decided, and you can argue if it makes sense or not, but I love it because it's clear.

BP We could argue about how onerous the burden imposed on companies is or employees who have to take copious HR and PII and GDPR training, but certainly it convinces companies that they should take it seriously. That is for sure.

OL But at least you know. You know what you should and shouldn't do. You can not agree with that, but at least it's not grey, it’s not in the middle.

BP I can agree with the laws in principle and still resent all of the training I have to do every year.

OL Agreed. Same here.

BP So we've covered off on a lot of the big questions, but one of the things that you put out there that we could talk about was how to protect yourself now with the data you use for the rules that will come later. I think this is a super interesting question. So for all the developers who are listening and the CIOs and CTOs and people who are at an organization who may have either a great dataset that they're concerned they don't want to become part of just the public commons, or they're saying, “Listen, we might use a foundation model for this, but we also want to build our own internal LLMs, small size but more specific. How do we make sure that that data is unique to us so that when we get out there and offer it in the market, it's not just something that belongs to everybody?” Talk a little bit about the advice you would give to organizations who are grappling with these questions in the absence of regulation.

OL So I always follow the same rule. It worked for us really well. It kind of sounds trivial and maybe a bit altruistic, but it's true. Transparency is very, very important. It doesn’t matter what side you’re on. If you’re on the side who is collecting data or the side who is generating the data, document everything and just put it out there. If you go to Bright Data, we have a trust center and every question that you have is already answered there about the practices, the scale, why, how. And we found it to be very, very important, first of all, because if you put it out there, you make sure that internally it will work this way, because sometimes you get this, I don't know if it's a legal opinion or just a decision that you got because you think this is how things should work, but you're not really serious about it. And you should, especially in this industry, know that things will happen, regulation will be on AI and on datasets and things like that. Put it out there. You will also get really good feedback just by putting it out there and you'll also find out that when being transparent about why you do things and how you do them, the other side will be very accepted for that and will actually be willing to share feedback and will have a lot more patience if you did something wrong, just because you're honest about it. And sometimes things aren't clear so you just need to get a decision and commit to it. Do that, that's fine, but be transparent about that. It's pretty broad, but actually that's what's been working for us for so long.

BP Cool. All right, Or. So I'm just going to throw a couple of weird questions out. The first one is, we're talking about synthetic data and we're saying that in some instances, if you are really careful about how you select the training data that you use to then create synthetic, it could be great. If you say, “Only the top 10,000 Stack Overflow questions with accepted answers,” a huge bunch of humans have already done the work of validating this data in many different ways, so it's great training data. If you say, “Go out and read every blog post that was published today and use that,” maybe half of them were written by AI and now we're on our way to model collapse. But I do see research coming out that indicates that Gen AI can produce novel output. If we're talking about folding proteins, it can come up with new ideas. I saw a study recently that said that if you take a hundred PhD-level researchers and you show them proposals for new ideas for research of where should science go next, they, by a large majority, choose the ideas generated by AI as being more interesting than the ideas generated by other human beings. So this is more that we're just going off on a philosophical tangent here, but is AI creating new novel things of value? And if so, can that become part of our ecosystem– the internet, all the data, the art, the culture, the thought in a healthy way? Or if not, what guardrails do we need? What considerations do we need? What human in the loop do we need for that kind of stuff?

OL I like to look at these things as just trying to perceive AI in general as a faster calculator than the human brain. So I don't think that I'll call it or refer to it as novel ideas. It's just faster to get to the novel ideas that the human brain will get to in X amount of time, years, centuries, whatever. So I don't think that it's actually creating new knowledge. Folding proteins, it's compute, it's a lot of effort getting into it, but if you probably have a huge, huge, super smart, fast brain or collective brain, that will happen as well. So it's not that AI will come up with new protein.

BP Right. This is a term I've used. Ryan, I don't know if you've used it, but this is a term I used that a friend said to me, that the AI is a thought calculator. It's not coming up with new ideas unless you ask it to. It's not coming up with new ideas in a vein that it wasn't instructed to, but it can take ideas, reason, language, math, code, and work through them, and to your point, Or, when they score really high on the math Olympiad on a novel set of questions, it's like, “Good for them.” But if you read the fine print, the AI scores really well when it gets to generate 10,000 answers and then have a subroutine that checks those answers and then have a subroutine that votes for the best answer among the answers that checked out. So if my brain could do that, I would score pretty well on this test, too.

OL I will use ‘thought calculator.’ I really like that. But I think that the dangerous part here is that the human brain has an ego. And then let's just put the ego aside, it's actually a really good extension to our brain to have a thought calculator. So let's think it’s actually innovative. Why not? Maybe we can reach some amazing stuff like that if we think this way.

BP One hopes, for medicine or fusion or whatever. We need some more ideas, solving climate change. All right, the last thing I wanted to do before we go is, we talked a little bit at the beginning. I said, “What's your backstory,” and you said, “I'm not really a coder. I came into this more just as a problem solver.” But when I Googled you, there's this amazing excerpt. It says, “Or’s path to becoming a CEO included cleaning toilets, dropping out of school, and telling a company founder that he did not like his product.” So could you just say that? If you could just say, “That's how I got started,” because I like that version better.

OL Indeed, that is how I started. I started by cleaning toilets in the cinema, but that's not that techie, so I moved on. But actually I learned a lot from cleaning toilets. Maybe I should write a LinkedIn post about that.

BP Definitely.

OL And I always liked to build products. I started with building my own products online. One day I happened to use a product, it was a VPN product that I tested and actually used, I needed it. And I had a lot of feedback, so I just found that the founder happened to be also the founder of Bright Data. And I sent him an email with my disappointment from how the product operated, but also with recommendations for how it should work. And I think the rest is history. He liked that. He also liked the fact that I gave solutions, not just complaints. We met, and a few hours after we met, I had a few companies of my own and I sold all of it in a couple of weeks to the partners that I had just to come work with these guys and to try and solve the bigger problem that we're solving today.

BP I’ve got to say, you need to just always know to tell that story. Because if I say, “What's your origin story?” you just need to go with that one, because it's really good.

OL So I cleaned toilets. This is how I started.

BP Also that you insulted this guy and that he liked that you weren't a yes man and then that you sold everything and went to do it. This is how movies are made.

OL You know what? I want to hope, and I actually think that this is how I operate today as well. So I'm looking for the insults, that's how you get better.

RD You heard it here first. Send your insults.

[music plays]

BP All right, everybody. It is that time of the show. Let's shout out someone who came on Stack Overflow and added a little knowledge to the library we all access. Congrats, Guizo, awarded a Populist Badge seven hours ago. Your answer was so good that it got more upvotes than the accepted answer. “How can I minify JSON in a shell script?” Guizo has a great answer for you, earned himself a Populist Badge, and helped 30,000 folks with the same question. As always, I am Ben Popper. You can find me on X @BenPopper. Email us with questions and suggestions for the show: podcast@stackoverflow.com. If you want to come on and be a guest or you want to hear us talk about something, let us know. And if you enjoyed today's program, why don't you subscribe so you can get more episodes in the future.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me on the internet, you can find me at LinkedIn.

OL And I was Or Lenchner, CEO of Bright Data. You can always reach out to my LinkedIn, that's Or Lenchner. Feel free to go to brightdata.com to learn more about Bright Data. And if you need web data, you know where to find us.

BP Awesome. All right, everybody. We'll put those links in the show notes. Thanks for listening, and we will talk to you soon.

[outro music plays]