The Stack Overflow Podcast

Tragedy of the (data) commons

Episode Summary

Ben chats with Shayne Longpre and Robert Mahari of the Data Provenance Initiative about what GenAI means for the data commons. They discuss the decline of public datasets, the complexities of fair use in AI training, the challenges researchers face in accessing data, potential applications for synthetic data, and the evolving legal landscape surrounding AI and copyright.

Episode Notes

The Data Provenance Initiative is a collective of volunteer AI researchers from around the world. They conduct large-scale audits of the massive datasets that power state-of-the-art AI models with a goal of mapping the landscape of AI training data to improve transparency, documentation, and informed use of data. Their Explorer tool allows users to filter and analyze the training datasets typically used by large language models.

Shayne and Robert are the authors of a new study called Consent in Crisis: The Rapid Decline of the AI Data Commons: the first large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training sets.

Connect with Shayne via his website.

Connect with Robert via his website or on LinkedIn

Stack Overflow user George Hawkins earned a Populist badge by explaining How to get base url in angular 5?.