Ben chats with Shayne Longpre and Robert Mahari of the Data Provenance Initiative about what GenAI means for the data commons. They discuss the decline of public datasets, the complexities of fair use in AI training, the challenges researchers face in accessing data, potential applications for synthetic data, and the evolving legal landscape surrounding AI and copyright.
The Data Provenance Initiative is a collective of volunteer AI researchers from around the world. They conduct large-scale audits of the massive datasets that power state-of-the-art AI models with a goal of mapping the landscape of AI training data to improve transparency, documentation, and informed use of data. Their Explorer tool allows users to filter and analyze the training datasets typically used by large language models.
Shayne and Robert are the authors of a new study called Consent in Crisis: The Rapid Decline of the AI Data Commons: the first large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training sets.
Connect with Shayne via his website.
Connect with Robert via his website or on LinkedIn.
Stack Overflow user George Hawkins earned a Populist badge by explaining How to get base url in angular 5?.
[intro music plays]
Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, Director of Content here at Stack Overflow. When Stack Overflow was started in 2008, it was really an attempt to create a place where software developers, engineers, technologists, could share knowledge without a paywall. At the time, there was something called Experts Exchange where you might find answers to your coding questions but you'd have to pay up first, or you might find answers to your coding questions but they were inside of a long rambling blog and hard to say if the answer was correct. So Stack Overflow's idea was to harness the wisdom of the crowds, make this forum, give people free internet points –who doesn't love reputation gamified a little– and then create these knowledge artifacts. And so over time that's built up to, I believe it's now over 20 million, probably far more than that, question and answer couplets with accepted answers and more added each day. Now, one of the biggest changes in the world of technology over the last year and a half has been the explosive growth in focus around generative AI and large language models, and large language models can be quite good at generating code, and some of them are trained on Stack Overflow data. So we, for example, announced partnerships with OpenAI and with Google for Gemini, where they will train their models on our data, and in return, we receive resources that we can then funnel back into the company and the community. And the idea there is that hopefully there becomes a virtuous cycle between the humans who create the knowledge and the AI that is trained on it. Today, I have two great guests– Shayne Longpre, an MIT PhD student, and Robert Mahari, who is both an MIT PhD and a Harvard JD. They are here to discuss a paper that came out– “Consent in Crisis: The Rapid Decline of the AI Data Commons,” which looks at the decline in availability of some of the biggest datasets that were once public and served as the training for these foundation and frontier LLM models or Gen AI models. And the two of them work with an organization called the Data Provenance Initiative that's auditing and looking at and thinking about the future of public data on the internet and how that will work with respect to AI systems and their training corpuses. So without further ado, Shayne, Robert, welcome to the Stack Overflow Podcast.
Shayne Longpre Great to be here, Ben. Thanks for having us.
Robert Mahari Thanks for having us, Ben.
BP So first of all, maybe tell folks how it is that you got involved with this particular project and what was sort of the genesis of this paper.
SL So we've been working for a while on tracing the provenance, the sources, the licenses of datasets that are used in training AI models and also the representation and diversity and heterogeneity of that data as well. Because it turns out these datasets really are what's driving the capabilities that are really impressive in modern AI models, and surprisingly, people don't have a good grasp, including the engineers, of the composition of that data, where it came from, what are the potential issues with it, whether it's legal, ethical, or just the appropriate composition for what you want to do. And so we wanted to shed more transparency into that and also bring light into what the trends are, because many people weren't realizing the legal and ethical issues until lawsuits like The New York Times, but actually there's been issues for a while and so some of our studies shed light on this.
RM From my perspective, I think it's interesting to think about where this originated because we've now developed four or five different related works on data provenance. And we really started in May 2023 when, like Shayne said, we started being interested in some of these questions. And in my view, it might be worth thinking about who is the audience for this work, and on the one hand, there are the people who are developing these Gen AI tools who have an incentive to be law-abiding and reduce risk and also have the highest quality data. And for that community, I think we wanted to try to provide access to information on provenance that would enable them to make more informed decisions. But there are also other communities here. You have the legal community which is increasingly interested in some of these questions as lawyers, as legal practitioners trying to understand what does all of this mean, where does fair use apply, what about copyright and all these questions. And while we can't provide legal advice through our papers, I do think that some of the empirical insights we get are really relevant to those lawsuits. Similarly, the regulators who are more and more thinking about AI regulation, I think some of the insights that we've gathered have been able to shed light on key aspects of AI regulation. Happy to talk more about that. And so it's been really great to put together a team of computer scientists and lawyers who are interested in this work and to get insights that are so broadly applicable.
BP So you mentioned The New York Times lawsuit. Obviously, that will proceed through the courts and be tested it seems, and may set interesting precedent. Would you compare this in some ways to the early days of search when a lot of questions arose around should big search engines be licensing this or paying some kind of transactional fee every time they surface an excerpt of the content? And as search engines became more and more proactive about surfacing large bits of content, the question became, are they driving traffic in the same way they used to, or are they simply answering the user's query right on their search page, and in that sense, getting more and more value from the content that they index without providing this concomitant value to the organizations that are out creating the news and the content.
RM A lot of the legal cases that are being cited in the Gen AI lawsuits are actually from that era of search. In many ways, the legal questions are similar. It's about to what degree can you use this content in order to create a product that is quite different from the original purpose of the content? And if we're looking at some of the tool usage and the web search capabilities of Gen AI, you could imagine driving traffic to the original artifacts, but of course that's not the only way that people are using the content, and sometimes it is a kind of direct replacement and you have situations where someone won't access the original content because they can get the answer or the information through a generative tool. So without getting too much into the legal weeds of this, there are two types of discussions that are happening. The first is about whether training an AI model, before you've generated any outputs, whether that causes any sorts of legal issues, and there we've been drawing a bit of a distinction between fine-tuning data and pre-training data. There are other distinctions worth drawing, but the interesting distinction is that, by and large, pre-training data was created for some purpose other than training AI models, whereas fine-tuning data is generally created for the sole purpose of training AI models. And so it seems like some of the factors that people would consider in the fair use analysis would make way more in favor of finding a fair use for pre-training data than for fine-tuning data. But there are going to be exceptions, especially when you start looking at what are the factors of the fair use test in the US which relates to the market impact. And so of course there are going to be some pre-training providers where the market impact will be much more intensive for them than for others. So these are some of the considerations on the first side, and then you have the output side. So when a specific output is generated, does that specific output, if it's similar to something that was in the training data, cause a copyright issue? And there it seems like you have kind of a whole separate analysis which is about how you are using that output. So if I'm an artist and my art makes its way into the training data and you then generate an artifact that is really similar to a specific work of art that I've created, if you're only using that to show your friends or maybe print it out and hang it on your wall but you don't show anybody else, that is a different situation than if a company is using it in their marketing material. And so there's a separate, very fact-intensive inquiry that's needed that frankly I think we shed a lot less light on through our work.
BP Robert, just briefly, can you give the folks who are listening a quick overview of fair use doctrine and how that's applied in the past to data on the internet and how it relates to these new questions that are arising around large language models?
RM So the first thing to say is that fair use is an American concept, and one of the big challenges we'll have in the AI space is how do we gain global consensus on some of these legal questions. The fact is that a lot but not all of the major generative AI foundation model companies are located in the US, so it makes sense pragmatically to start here, but there are other regulations elsewhere. And so fair use is a doctrine that essentially allows you to use a protected work in some way that is very different from the original work's purpose that doesn't really compete with the original work. A good example is satire. So if you have a play and I write a satirical article about that play, even if I use quotations from the original play or maybe even images from the original play, if I use those in a way that is completely different in the purpose I'm trying to achieve that doesn't really compete with the market for the original, et cetera, then that's more likely to be deemed fair use. Another example is education, if I'm using copyrighted material for education. And so the big question is whether using content that was created for dissemination on the internet to train AI models whether that amounts to fair use, and there has been a lot of spilled ink on that question, both academically and in various lawsuits.
BP When I think about fair use, I often think of this test, like you said, is it transformative in some way and is it competitive in some way? And so if it's both transformative and noncompetitive, it will probably pass the fair use test, and you made a good example there with art. And then again, let's take this back to some of the questions internally. What we're asking is, if I am at Stack Overflow and a system has been trained on our data and then it goes out and generates code for folks, we specifically cited in our sharing of the data over the years that it should be used for educational purposes, academic research purposes, at-home personal purposes, but not commercial. So if you're charging for code gen and you've trained on our data, then maybe there is an issue, especially if the solutions that you're providing are essentially identical to the ones you would find on Stack Overflow. And so what we've been saying, and this came up before in some of what you said, was that, A, we want to work out licensing deals, we want to return value to the community, and then lastly, we want attribution if possible. You provide code, but then you provide citations to the questions on Stack Overflow that allowed this answer to be created and you provide links that folks can go and check the ground truth, which is actually important. We know that LLMs and Gen AI tend to hallucinate, so you get an answer, but you can actually also check it against the accepted answer on Stack Overflow which has been vetted by thousands of human beings. But Shayne, tell me how you're thinking about this particular question.
SL I think that maybe I can bring some light empirical analysis to this question. So in our paper, probably one of the most controversial plots is the last one that I will quickly disclaim and say there are several limitations. First off, it's from a dataset called WildChat that is volunteered contributed logs by real users who are using AI models. And they contributed their logs, and this is one of the few naturalistic data sources that we have to tell us how people are potentially using these models. The disclaimer is that, knowing that they're volunteering their logs, knowing that it's contributed in sort of an academic context, and given that it’s one model in one place and time, it doesn't represent all of generative AI and we don't know how representative it really is. But this one LLM use, in this case, ChatGPT, we actually manually went through hundreds, thousands of the logs and tried to representatively measure what the different uses were. And interestingly, Ben, can you guess, do you know what the number one use was?
BP I'm from Stack Overflow, so I'll say answering coding questions, but I don't know. Code generation?
SL So coding was one of the top few. Actually about 8-10 percent of all queries were for code, which surprises me because it's probably about 50-70 percent of what I do. The biggest actually was creative composition by a large margin, actually. So people writing fiction, poetry, role playing different things. And the second one, at least in this context, maybe somewhat surprisingly, there's a different tangent, is sexual role play. But in any case, we can actually see to some extent how people might be using these models. And it's interesting that something like current affairs and news was very low, but that might be because ChatGPT has the time cut off, whereas something like coding was pretty significant as are creative uses. And so in the case of Stack Overflow, you're absolutely right. It likely is that these models are being used for cases that are directly using Stack Overflow knowledge or data or at least other sources, GitHub, that might've been pulled from.
BP This is one of the areas where you might see– this is my opinion based on nothing but my gut and not legal expertise, but if it was people generating poetry or fiction or role playing scenarios for a Dungeons & Dragons campaign, and what they were saying was, “I need you to help me fill in something based on these adjectives and evocative descriptions I give,” but I'm not saying to write me three pages like a Robert Patterson novel and then I'm going to go do the rest, and then together we're going to basically write a Robert Patterson novel, in which case the artist might fairly claim, “It's read my works and now it's recreating them for you and so this is an issue.” Now you could say how is that really different from saying, “Write me a spy thriller that feels like a bestseller, and then together we're going to create this book,” but of course, what do people do? They read bestsellers and then they try to imitate and take from there what they think is best. I guess one of the questions then becomes, is the fact that the models can do it at scale and the fact that they can produce it at scale and velocity change things? If an artist has to go out and read a hundred detective novels and then think about how to synthesize that style and then sit down and do it themselves, that's one thing. If an AI system can read the entire internet and then produce thousands of novels a day as long as you have the money to pay for the inference, is that different? But let me let you speak to the question of attribution.
SL There's so many different tangents that are interesting here, Ben. On attribution, when it comes to many of these tasks, it's lacking just because something called influence functions or other methods that we try to develop in order to trace what examples in the training set influenced a generation are underdeveloped and it's not clear –people have been working on this for a long time– that we'll have any clear certainty into what distribution of examples influenced a certain generation and why and how to measure that in a clear way. So the real proposed panacea, which it isn't but the one that people really love, is RAG– Retrieval Augmented Generation. Because if you retrieve at test time when people are really using the model, different web pages, it could be Stack Overflow, it could be different news items, it could be actual fiction or something itself, that gives people opportunity to point directly to and site back to these sources and even flow traffic to them. And we're even seeing modern AI developers introducing new crawlers on the web, which is where they collect their data. Some are for training data, but then the newer ones are for search and retrieval. And so they're sort of separating out these two uses because they're hoping that maybe some artist doesn't want to be in the training data, but maybe they want to be referenced at retrieval time because they think that would drive more traffic to them or they could be cited or attributed, and it remains to be seen how this will develop.
BP We wrote a blog post about building that kind of system internally, and we have Stack Overflow for Teams internally which is our knowledge base that's a SaaS product we sell, and there that's exactly what it does. It performs a RAG type of search and then it can provide citations. And when I go and ask a cutting-edge LLM, “Hey, can you give me an update on the situation over here with these geopolitical actors?” often it'll say, “Searching the web, analyzing, verifying sources,” and then it will include citations. Or I like to ask it about academic research like, “Can you tell me what the most cited papers in brain machine interface were for the last month, and include a couple of links to the papers,” and then it does it. And so in that case, it's using the reasoning capabilities and the natural language capabilities to understand and reply that it gained through its training, but it's basically performing a web search and responding with what it finds there in order to provide you with the answer. And so that allows you to understand what data it used to create the response at inference time, whereas you pointed out, otherwise it's sort of a black box. We don't know of everything it trained on, what caused it to generate that response. And also these models are not deterministic. Five people could ask it the same question and get five slightly or even very different answers in return. Robert, do you want to weigh in on this?
RM Well, I wanted to take kind of a big step back and ask what is the law trying to achieve here? And there's an answer to that, and this is a quote, to promote the progress of science and the useful arts. And so the question is, at a really high level, we have on the one hand science and I think useful artistic outputs coming out of generative AI. We also have, completely unrelated to generative AI, people creating creative works, doing science, all sorts of interesting things. And so it seems like a trade-off. It does seem like we're kind of eroding some of the rights that people have maybe traditionally associated with their works to allow someone else to create these products. So it's this interesting high level question. We get into the nitty-gritty four factor fair use test and stuff like that, but there is this high level North Star that at least in theory is what we're trying to pursue.
BP Is that jurisprudence and precedent in the United States you're saying where we have this North Star of trying to promote the arts and sciences while also respecting copyright and licensing?
RM So the North Star comes from the constitutional basis for intellectual property rights which is this promotion of science and the useful arts. That's why we have IP law in the first place. That's why we're willing to give artists and creators and inventors, in the case of patents, this limited monopoly over their works.
BP Right, we have IP laws so that people feel that they have suitable protections to pursue the arts and sciences and not be ripped off, and we also try to have a fair use doctrine so that it's not so limiting as to outlaw satire or education. So that's the balance we're striking.
RM That's the balance. And with patents, we want people to disseminate their inventions. We don't want everything to be a secret. So you publish your patent, everybody knows what the invention is, but you get this monopoly for a limited time so that you can essentially recoup the investment you made. And copyright doesn't expire for a long time, but it does expire after a while and then things become public domain and you can use them any way you want.
BP I think you said something like that there's 1,800 major data sources and you cited a few really big ones that you know are public that might be changing. How do you know what to look at to say, “This, we think, is the corpus that it's drawing on.” And then the follow-up question would be, knowing what the sources are and knowing now how some folks are changing, you're talking about potentially a decline in the AI commons. What would be the impact of that? So let's start with how did you decide what you believe is the AI Commons? How do you decide on that if the training data of the models is often not released when they're shared?
SL So the data commons as we define it is largely publicly accessible, crawlable data. It's the web. And we use Common Crawl, a nonprofit, as the basis for this because literally millions of AI models have been trained off of this resource, and not just text models, but that crawl and the links have also been used to train multimodal models, so images, videos, speech, audio, all these things, and that is the foundation for modern foundation models. And so we look at the composition of that crawl, which is likely to be not too dissimilar from what big companies will also have access to, and we sampled the largest web domains contributing the most tokens or words in the dataset from those. About 2,000 from three major datasets, then 10,000 as a random sample to tell us what the web is and permissions and preference signals on the web writ large. And so we actually looked underneath popular pre-training datasets to the websites from which they're taken and did the audit on those.
BP I think I understand now. You're saying that we decided that we could get a good beat on what dataset they're using by first looking at the Common Crawl, which is enormous and encompasses a large portion of the internet, and then let's take out the constituent parts of that that are extremely large, and then let's also take a random sampling, and if we get the chance, look at fine-tuning data. All that I think makes sense to me. So now the follow-up question and the premise of the paper is, understanding that this is now happening, some folks are closing off their data. They're saying, “We don't want AI models to train on this,” similar to the way in the past that some people might've said, “We don't want this data indexed for search.” What is the thrust of the paper? What may happen to the data commons or the AI data commons and what would the results of that be?
SL So in less than a 10-month period, a significant portion of the data in really popular pre-training data sets was closed off. And in particular, about a quarter of the most popular, most well-maintained, highest quality data was retracted, at least if you respect these preference signals which are shown in their robots.txt. It's the one readable file on the internet for both machines and humans that essentially indicates whether or not we want our data to be used. If you look at the terms of service the web pages are putting up, the restrictions are much higher. But the unspoken tragedy is that while the terms of service is communicated in natural language that allows you to express exactly what you want or don't want to be done with your data, it's not machine readable, so machines rely on robots.txt and they ignore the terms of service. And so a significant amount of data is rapidly going away if you respect these signals, which many are and others might not be but might have to in the future, and we expect that to continue rapidly. And when it comes to AI language models, the scale of data and especially the high quality data really matters, and so this will affect the performance, maybe the factuality, maybe the composition and representativity of underlying data that's powering these models.
BP Gotcha.
RM The other thing to mention here is that you have the robots.txt where websites express preferences, but you also have other ways by which websites can make their content harder to access. And it seems like in a world where there is a lot of regulatory ambiguity and a mixed adherence to robots.txt and the terms of service, a lot of websites are essentially resorting to self help, making it harder to access this content. And in the aftermath of publishing this work, we've spoken to a lot of folks who are in the web crawling business and it seems like it is becoming much harder to get access to this data, putting the robots.txt aside. And this is somewhere where my personal thinking has really evolved, because when we started this work a year ago, I was of the opinion that at a societal level, it doesn't really make sense to have a licensing regime for pre-training data, because how would you even administer this? You would impose such costs on smaller companies that are trying to innovate, on researchers, that only the large companies that can afford to pay royalties and licensing fees would be able to compete and this would be a worse outcome for the world. But what we're seeing is that in absence of licensing revenue and an ability to essentially direct what kind of content is used and how, websites are making access to their data harder which means that we're back to square one. There are some companies that can afford to license this content, but for smaller players, for innovative players, for researchers, access is becoming harder and you have this negative externality for everybody. Forget about AI, but everyone who wants to use the web is encountering more restrictions, more paywalls, more impediments to accessing the web openly. And so, like I said, my thinking has evolved where it seems like without licenses, we're actually in a very similar world as with licenses. And what is really needed here, and I think this to me is the thrust of the paper, is we need clarity on what's going on and we need to give creators a way to signal their preferences and to enforce those preferences because otherwise they'll do self help and it'll be decentralized and fragmented in a way that's worse for everybody.
BP I wonder if the world of software development may provide a path here, which is to say that open source licensing can be quite specific and you could imagine maybe a similar regime being built for fiction or even for imagery where folks say, “There is limitations that I'm setting on how you can fork this or how you can train on it or what kind of output you can provide if the query requests something of mine by name, but you have to respect the licensure I've set out,” kind of like how the Apache license would for certain kinds of open source code that can be reused by companies for commercial purposes as long as they follow the strictures. But y'all are researchers, so what are the second order effects happening to academics and folks who want to look at this data which is now being walled off?
SL I think it's really important to highlight that the current regime is devolving into a crawler's arms race, which means that the crawlers are becoming more sophisticated and working hard to get as much data as they can and the websites are fighting back, not just with robots.txt preference signaling, but also by imposing technical barriers as well to try to differentiate and block bots from humans. And because there's one pipeline that all the bots go through whether they're for researchers or for non-AI related uses, maybe it's product catalogs, or price comparison websites, or financial market research, web archives, or academic research, they all go through the same pipeline that really is not robust to modern complex diversity of uses. And we really see that in our paper. I won't go into too much detail, but especially the smaller players, the creators and website domains are having a really tough time tracking hundreds of different AI crawlers, figuring out which ones to block, what they're being used for, whether they're anticompetitive for them or would actually direct traffic to them, and as a result, they are trying to block just about everything or being much more aggressive for fear of what it might do to their platform. And so how does this play out? It plays out in a way that the large companies and creators, something like Stack Overflow even, can get licensing deals or they can invest in strong, robust blockers against other crawlers, or have a team of technicians that can figure out which crawlers to accept and which ones to block, but this doesn't work out for small creators who don't have any of those resources and don't have ability to license that data or block it. Think small artists, small publishers, things like that. And it doesn't work out for researchers because tens of thousands of academic articles have been written on the web using these web archives, but right now their ability to access data is the first one that's going to go because they don't have the ability to license or to access it easily, and we've talked to many who are struggling with that.
[music plays]
BP All right, everybody. It is that time of the show. We want to shout out someone who came on Stack Overflow and provided an answer or shared a little bit of their curiosity. Awarded two days ago to George Hawkins– the Populist Badge. That's when you provide an answer to a question that has an accepted answer, but your answer is so good that it outscores the accepted answer by at least 2x. So George, thanks for your answer to the question, “How to get the base URL in Angular 5,” and congrats on your Populist Badge. I am Ben Popper. You can find me on X @BenPopper. If you have questions or suggestions for the show, you want to come on and be a guest, or you want to hear us talk about something, email us, podcast@stackoverflow.com. And if you liked the show today, the nicest thing you could do for me is hit that subscribe button or leave us a rating and a review.
RM My name is Robert Mahari. It's great to be here. I do research on computational law at the MIT Media Lab and at Harvard Law School. If you want to know more about my work, you can check out my website at robertmahari.com or on LinkedIn. And if you want to know more about this work, check out our papers, Consent in Crisis, or the whole Data Provenance website.
SL Thanks for having me, Ben. It's been a pleasure to be on the podcast. I'm Shayne Longpre, I'm a researcher at MIT. And you can find my work at shaynelongprey.com or our work on the Data Provenance Initiative at dataprovenance.org. Please don't hesitate to reach out if you'd like to chat more.
BP Awesome. We'll put those links in the show notes. All right, everybody. Thanks for listening, and we will talk to you soon.
[outro music plays]