Tuesday, March 25thIN DATA NEWS OpenAI must face part of Intercept lawsuit over AI training OpenAI lost a bid to dismiss a lawsuit alleging it misused news articles published by The Intercept to train ChatGPT. This is a win for media outlets, although the same New York judge dismissed The Intercept's claim that OpenAI unlawfully distributed its articles after removing their copyright information. 😌 Data creators add one to the win column for the regular people. 🥷🏽Data thieves must come face-to-face with their crimes. Well, partially. The Intercept (e.g., the data creators) successfully showed the data trail of how its articles were de-copyrighted by OpenAI (e.g., the data thieves) using algorithms like Dragnet and Newspaper to remove copyright management information besides the main text of its articles. The Intercept argued that OpenAI Inc., as a result, violated the Digital Millennium Copyright Act by training on The Intercept’s works. This part of the lawsuit is moving forward because the data creators’ lawyers told the compelling data story well. So, what’s copyright management information?According to the Copyright Alliance, “[c]opyright management information, or CMI, is information about a copyrighted work, its creator, its owner, or use of the work that is conveyed in connection with a copyrighted work. For example, CMI would include the copyrighted work’s title, ISBN number or copyright registration number; the copyright owner’s name; the creator’s name. name; and terms and conditions for use of the work.” Let’s say it more plainly: all content has context and that context can’t be removed without explicit permission by the content authors/owners. And, wait…are there algorithms that can remove this contextual information? Yep. Dragnet, a machine learning-based content extraction method, is an open-source Python software library, available here. Published in 2013 at the World Wide Web Conference, Dragnet was at the forefront of innovation that extracted the main body of an online article and user comments in the midst of multi-panel webpage layouts with a mix of text, images and videos. Newspaper3K is a more modern variation of Dragnet. It’s also an open-source Python software library, available here. The robust features include: multi-threaded article download framework: news url identification, text extraction from html, top image extraction from html, all image extraction from html, keyword extraction from text, summary extraction from text, author extraction from text, Google trending terms extraction and its ability to work in 10+ languages (English, Chinese, German, Arabic, ...). And yes, the intent of these algorithms may have not been to facilitate illegal action, but they’ve too often been applied to perpetuate algorithmic-based harms and digital rights infringements. AI (research) companies must then deal with lawsuits, bad PR, reputation/customer loyalty issues, sanctions and monetary penalties. Ok, 🤯…why don’t data thieves stop stealing then?It’s the extreme cost of training these AI models. The cost of training frontier AI models has grown by a factor of 2 to 3x per year for the past eight years, suggesting that the largest models will cost over $1B by 2027. Check out this Statista chart for a snapshot of AI training costs for DALL-E, ChatGPT and Gemini in recent years. The cost of litigation is way less than the cost of AI model training. These companies also have too much fiscal investment in monetizing generative AI outputs. For example, Google Gemini spent nearly $200M in AI model training and reported generating $730M in revenue in 2024. Morality and ethics are placed on the curb when the math results in these numbers. But don’t forget, data creators won this round and are likely to win more. 😌 For those interested in knowing more about copyright lawsuits with AI (research) companies, a running list is available.
HAPPENINGS & APPEARANCES
LAUGHING IS GOOD FOR THE SOUL Stay Rebel Techie, The DataedX Team Thanks for subscribing! If you like what you read or use it as a resource, please share the newsletter signup with three friends! |
Removing the digital debris swirling on the interwebs. A space you can trust to bring the data readiness, AI literacy and AI adoption realities you need to be an informed and confident leader. We discuss AI in education, responsible AI and data guidance, data/AI governance and more. Commentary is often provided by our CEO, Dr. Brandeis Marshall. Subscribe to Rebel Tech Newsletter!
December 3, 2024 👋🏾 Reader, Wishing you and yours a happy holidays. As the DataedX team settles into our Winter Rest period (now until Jan 6-ish), I wanted to share the mounds of good trouble we've gotten into this year. It has been a year full of learning, teaching and leadership development. We’re steadfastly focused on integrating equity throughout DataOps and AIOps. We believe in making data and AI concepts snackable from the classroom to the boardroom. This means that our society can be...
June 25, 2024 The Rebel Tech Newsletter is our safe place to critique data and tech algorithms, processes, and systems. We highlight a recent data article in the news and share resources to help you dig deeper in understand how our digital world operates. DataedX Group helps data educators, scholars and practitioners learn how to make responsible data connections. We help you source remedies and interventions based on the needs of your team or organization. IN DATA NEWS The impact of...
June 11, 2024 The Rebel Tech Newsletter is our safe place to critique data and tech algorithms, processes, and systems. We highlight a recent data article in the news and share resources to help you dig deeper in understand how our digital world operates. DataedX Group helps data educators, scholars and practitioners learn how to make responsible data connections. We help you source remedies and interventions based on the needs of your team or organization. IN DATA NEWS No Robots(.txt): How...