RTN No 80: Data Creators vs Data Thieves 🥊

Tuesday, March 25th

IN DATA NEWS

OpenAI must face part of Intercept lawsuit over AI training

OpenAI lost a bid to dismiss a lawsuit alleging it misused news articles published by The Intercept to train ChatGPT. This is a win for media outlets, although the same New York judge dismissed The Intercept's claim that OpenAI unlawfully distributed its articles after removing their copyright information.

😌 Data creators add one to the win column for the regular people.

🥷🏽Data thieves must come face-to-face with their crimes. Well, partially.

The Intercept (e.g., the data creators) successfully showed the data trail of how its articles were de-copyrighted by OpenAI (e.g., the data thieves) using algorithms like Dragnet and Newspaper to remove copyright management information besides the main text of its articles. The Intercept argued that OpenAI Inc., as a result, violated the Digital Millennium Copyright Act by training on The Intercept’s works. This part of the lawsuit is moving forward because the data creators’ lawyers told the compelling data story well.

So, what’s copyright management information?

According to the Copyright Alliance, “[c]opyright management information, or CMI, is information about a copyrighted work, its creator, its owner, or use of the work that is conveyed in connection with a copyrighted work. For example, CMI would include the copyrighted work’s title, ISBN number or copyright registration number; the copyright owner’s name; the creator’s name. name; and terms and conditions for use of the work.” Let’s say it more plainly: all content has context and that context can’t be removed without explicit permission by the content authors/owners.

And, wait…are there algorithms that can remove this contextual information?

Yep. Dragnet, a machine learning-based content extraction method, is an open-source Python software library, available here. Published in 2013 at the World Wide Web Conference, Dragnet was at the forefront of innovation that extracted the main body of an online article and user comments in the midst of multi-panel webpage layouts with a mix of text, images and videos.

Newspaper3K is a more modern variation of Dragnet. It’s also an open-source Python software library, available here. The robust features include: multi-threaded article download framework: news url identification, text extraction from html, top image extraction from html, all image extraction from html, keyword extraction from text, summary extraction from text, author extraction from text, Google trending terms extraction and its ability to work in 10+ languages (English, Chinese, German, Arabic, ...).

And yes, the intent of these algorithms may have not been to facilitate illegal action, but they’ve too often been applied to perpetuate algorithmic-based harms and digital rights infringements. AI (research) companies must then deal with lawsuits, bad PR, reputation/customer loyalty issues, sanctions and monetary penalties.

Ok, 🤯…why don’t data thieves stop stealing then?

It’s the extreme cost of training these AI models. The cost of training frontier AI models has grown by a factor of 2 to 3x per year for the past eight years, suggesting that the largest models will cost over $1B by 2027. Check out this Statista chart for a snapshot of AI training costs for DALL-E, ChatGPT and Gemini in recent years. The cost of litigation is way less than the cost of AI model training. These companies also have too much fiscal investment in monetizing generative AI outputs. For example, Google Gemini spent nearly $200M in AI model training and reported generating $730M in revenue in 2024.

Morality and ethics are placed on the curb when the math results in these numbers. But don’t forget, data creators won this round and are likely to win more. 😌 For those interested in knowing more about copyright lawsuits with AI (research) companies, a running list is available.

Read The Entire Article Here!

Grab Your Copy!

HAPPENINGS & APPEARANCES

[ 📗 VIRTUAL BOOK FAIR 📗] Women in AI Ethics™ (WAIE) is delighted to announce our first virtual AI Ethics book festival on this Friday March 28, 2025. This festival reflects our core mission and key role as a catalyst in the global movement towards inclusive and ethical AI. While AI presents many benefits, there is an urgent need to elevate voices that ensure the harms to society are minimized and benefits from AI are distributed equitably. Learn more HERE.
[🤸🏾‍♀️ LET’S GET DATA FIT 🤸🏾‍♀️] Actionable Intelligence for Social Policy (AISP)'s A Toolkit for Centering Racial Equity Throughout Data Integration Version 2.0 and Companion Workbook was released on February 28, 2025. This collaborative project with 100+ contributors, including DataedX Group, shares more than 50 new examples of Work in Action from across the data lifecycle along with strategies for collecting and disaggregating Race, Ethnicity, Language, and Disability (RELD) and Sexual Orientation and Gender Identity (SOGIE) data. This Toolkit was originally released in 2020, and has been used by hundreds of organizations and agencies seeking to acknowledge and mitigate for the harms and bias baked into data, data infrastructure, and government data practices. Learn more HERE.
[✨DATA COURSE ✨] Available on LinkedIn Learning, you can get a snackable overview or refresher on data modeling by taking Practical Database Design: Implementing Responsible Data Solutions with SQL Querying. Get started HERE.

LAUGHING IS GOOD FOR THE SOUL

Stay Rebel Techie,

The DataedX Team

Thanks for subscribing! If you like what you read or use it as a resource, please share the newsletter signup with three friends!

DataedX Group