TLDR: The ongoing conflict between news publishers and AI firms over content usage and compensation is leading to a critical juncture for the digital information landscape. This article, Part 2 of a series by Public Knowledge, explores various strategies publishers are employing—including licensing, lawsuits, and legislative advocacy—alongside emerging technical and market solutions. It also proposes a middle ground through mutual signaling mechanisms, data cooperatives, self-identification standards for bots, and extended collective licensing, aiming to balance fair use with sustainable journalism in the age of AI.
The digital landscape is currently defined by a significant “tug of war” between online news publishers and AI developers, centered on the use of copyrighted content for AI model training and output generation. This conflict, as highlighted by Lisa Macpherson of Public Knowledge, threatens to create a more closed internet if not addressed with balanced solutions. Part 2 of this series delves into the strategies publishers are adopting and proposes policy solutions to foster a mutually beneficial ecosystem.
Publishers’ Defensive Strategies:
News publishers are deploying a multi-pronged approach involving legal, legislative, and technical measures to mitigate the impact of generative AI on their business models.
Licensing: A growing trend sees major and smaller news publishers entering direct voluntary licensing agreements with AI developers. These deals, which can grant AI firms access to content for training, output generation, or both, often include financial compensation, brand attribution, preferential placement, or access to AI technology.
Data: Over 100 confirmed deals exist between platforms and publishers, involving every major AI developer and more than 700 news brands. Notable examples include OpenAI’s $250 million, five-year deal with News Corp and Google’s $60 million annual agreement with Reddit. The licensing market is projected to reach $30 billion at its high end within the next decade, contrasting with an estimated $7 trillion in AI infrastructure costs over the same period.
Emerging Solutions: To scale licensing, Real Simple Licensing (RSL) has been launched, defining specific licensing terms and involving a collective licensing organization for royalty negotiation and collection. Perplexity AI has also introduced a revenue-sharing program through its Comet browser, allocating 80% of user subscription revenue to participating publishers based on direct visits, content citations, and AI assistant usage.
Lawsuits: Approximately a dozen of the 48+ copyright infringement cases against AI companies in the U.S. originate from news publishers. These suits target firms like OpenAI, Microsoft, Perplexity, Cohere, and Stability AI, often alleging scraping of paywalled content, verbatim regurgitation, substitutional summaries, and brand damage from “damaging hallucinations” or misattributions.
Key Court Rulings: Early court decisions in cases like Bartz v. Anthropic and Kadrey v. Meta have affirmed AI model training as a “transformative fair use” of copyrighted content. Judges have stated that generative AI outputs are not inherently infringing, and publishers have no legal entitlement to a licensing market. However, content obtained through illegal means may still be subject to legal redress.
Penske Media’s Suit: A recent lawsuit by Penske Media against Google alleges illegal use of content for AI Overviews’ output, citing traffic declines and “anticompetitive practices” that threaten independent journalism.
DOJ vs. Google: Publishers were disappointed by a judge’s refusal to mandate an opt-out mechanism for Google’s AI training without affecting search presence, or to prohibit exclusive content agreements in the U.S. Department of Justice’s antitrust suit against Google.
Legislation: Publishers continue to advocate for legislative intervention, reminiscent of past efforts to protect against technological disruption from radio and television. The “Journalism Competition & Preservation Act” (JCPA) is still actively advocated, now purportedly encompassing AI training, which critics argue threatens fair use.
Congressional Hearings: Senate Judiciary Subcommittees have hosted hearings on “Oversight of A.I.: The Future of Journalism,” where witnesses from Condé Nast, National Association of Broadcasters, and News/Media Alliance argued that AI training and output are not fair use and called for new laws to clarify this, enabling a licensing market.
Proposed Bills: Several bills are under consideration:
- COPIED Act: Aims to attach content provenance information, but critics argue it would inhibit free expression by prohibiting AI use of copyrighted content without permission.
- TRAIN Act: Seeks transparency in AI training but could lead to numerous “nuisance lawsuits” through an administrative subpoena process.
- AI Accountability and Personal Data Protection Act: Proposes a sweeping opt-in mechanism for copyrighted content, which would make the U.S. the most restrictive jurisdiction for AI development, even more onerous than the EU’s opt-out system.
State-Level Action: At least four U.S. states have drafted legislation on AI and copyright.
Financial and Technical Barriers: Publishers are also taking proactive steps, including:
- Content Sequestration: Using paywalls, updated terms of service, and technical measures like
robots.txtto block AI web crawlers. As of January, over 88% of top-ranked U.S. news outlets blocked AI crawlers. However, many AI firms disregardrobots.txtor use stealth scrapers, leading to a “cat-and-mouse game” that makes the internet more closed. - Takedowns: News/Media Alliance secured the takedown of “paywall bypasser” website 12ft.io. Academic publishers have used DMCA subpoenas to obtain user data from “shadow libraries.”
- Infrastructure Solutions: Cloudflare now blocks AI web crawlers by default for its clients, adopting a “permission-based approach.” DataDome offers professional bot management solutions.
Tolling and Monetization Mechanisms: Intermediaries are developing products to monetize AI crawler traffic:
- Cloudflare’s “Pay Per Crawl”: A marketplace for publishers to request compensation from AI companies for each crawled page.
- TollBit: Enables publishers to control access, analyze traffic, and prepare for monetization.
- ProRata: Integrates advertising and attribution for revenue sharing based on LLM outputs within its “ethical” search product, Gist.
- Human Native: Connects premium data suppliers with reputable AI developers for secure data licensing.
- Created by Humans: A platform for authors with similar features.
- GoDigital Media Group’s “Ecosystem”: Proposes an
ai.txtfile, a public provenance database, industry collaboration, APIs to copyright offices, statutory licensing, and a collective management organization. - IAB Tech Lab’s AI Content Monetization Protocols (CoMP): A technical framework for AI firms to compensate publishers based on content appearance in LLM queries, preferring a per-user-query model.
Early Impact of Generative AI on News Publishers:
Publishers are already reporting significant damage to their cost structures and revenues.
Declines in Traffic: While some AI tools initially increased search referrals, this is not offsetting a higher rate of “zero-click searches” from Google’s AI-powered overviews (now AI Mode). Google users encountering an AI summary are 50% less likely to click on links and rarely (1% of the time) click on links within the summary. They are also more likely to end their browsing session. Digital Content Next reported a median year-over-year referral traffic from Google Search down 10% for its 19 digital publishers, with news brands down 7%.
Overwhelmed by Crawlers and Bots: AI training data crawlers and bots are increasing publishers’ infrastructure costs. TollBit reported an 87% growth in total AI user agent traffic from Q4 2024 to Q1 2025, with retrieval augmented generation bots exceeding training bots. Referral traffic from AI bots remains minuscule (0.04% in Q1 2025), insufficient to offset declines from traditional search.
Lack of Control: Publishers struggle to control content access due to AI firms not separating AI user agents from search ranking crawlers (e.g., Google AI Overviews, Microsoft Copilot, Apple’s AI tools), making blocking risky. Many AI firms also ignore robots.txt or use third-party/stealth scrapers, leading some publishers to block the Internet Archive, resulting in a loss of digital history.
Options for a Middle Ground:
Public Knowledge proposes several promising directions for policy and technical solutions:
- Mutual and Voluntary Signaling Mechanisms: Strengthening preferences for content use. Examples include the Internet Engineering Task Force’s (IETF) working group for standardizing content processing preferences, Creative Commons’ “CC signals project” for signaling reuse preferences with terms, and Spawning AI’s “Do Not Train Tool Suite” and “Have I Been Trained” registry. Policy could support these to give publishers more agency while preserving the open web.
- Data Cooperatives and Collectives: Models like Project Liberty’s vision for data cooperatives and RadicalXChange’s “data dignity” advocate for collective ownership and bargaining power over data. These could benefit small publishers lacking resources to negotiate with AI firms and offer AI firms competitive differentiation through “Fairly Trained” certification.
- Self-Identification Standards for Bots and Crawlers: Statutorily enforcing unique identifiers for bots would act as a “friction-creating gatekeeper and census-taker” for publishers, providing transparency without blocking access, which is more compatible with an Open Internet. This would negate the need for court subpoenas to identify content access.
- Extended Collective Licensing (ECL): The U.S. Copyright Office has discussed ECL, where a collective management organization (CMO) licenses works of members and non-members. While the Copyright Office’s report had flaws, ECL could extend licensing benefits to smaller news organizations. It could also offer “safe harbor assurances” for AI firms, insulating them from litigation if they work through the CMO.
- Statutory Safeguards for Public Interest Uses: Explicit legal protections are needed for academic researchers, non-profit AI auditors, open-source developers, and cultural heritage institutions. Without these, rising costs for copyrighted information could restrict access to only well-resourced private firms, leading to a “disastrous loss for research and the common good.” The EU’s text and data mining (TDM) exceptions offer a model, but the U.S. needs broadly inclusive and legally certain protections.
Also Read:
- EU Parliament Report Urges Overhaul of Copyright Rules for Generative AI
- The Atlantic Unveils Tool for Creators to Monitor YouTube Content in AI Training Datasets
Conclusion:
Finding a middle ground is crucial for the future of both AI innovation and sustainable journalism. The proposed solutions aim to balance fair use principles with the economic realities of news publishing, ensuring an open internet and informed citizenry. Further assessment, technical examination, and policy analysis are required to develop a comprehensive approach.


