newsletterAIcopyrightAnthropicOpenAIfair-use

The $1.5 Billion Library: How AI Devoured Humanity's Knowledge

2026-02-06

7 min read

1232 words

The $1.5 Billion Library: How AI Devoured Humanity's Knowledge

Prologue: Must We Destroy Books to Make AI Smarter?

Everyone marvels at AI's remarkable writing abilities.

But I have a different thought.

Millions of books had their spines sliced off by industrial cutting machines to pay for that ability.

In early 2024, an AI startup called Anthropic launched a classified project. According to internal documents: "Project Panama is our effort to destructively scan all the books in the world. We don't want it to be known that we are working on this."¹

In summer 2025, this secret exploded in court. The result? $1.5 billion in settlement—the largest copyright settlement in American history.²

I – Project Panama: A Quiet Mass Destruction

Anthropic spent tens of millions of dollars acquiring millions of books. Bulk purchases of tens of thousands at a time from used bookstores like Better World Books and World of Books.

What happened to those books?

"A hydraulic powered cutting machine 'neatly cut' the books, pages were scanned on high speed, high quality production level scanners, and finally a recycling company picked up the completed books."[^1]

Between 500,000 and 2 million books. In six months. That was Project Panama's scale.

Why go this far?

An Anthropic co-founder theorized in a 2023 document: training AI on books could teach them "how to write well" instead of mimicking "low quality internet speak."

Books are different from internet garbage. Edited, vetted, refined knowledge. For AI companies, books were goldmines.

II – Shadow Libraries: The Temptation of Piracy

Before Project Panama, there was a darker history.

Anthropic co-founder Ben Mann downloaded fiction and nonfiction from a "shadow library" called LibGen over 11 days in June 2021.¹ LibGen is an illegal database of pirated books.

A year later, when a new pirate site emerged, he sent colleagues a link with the message:

"Just in time!!!"

Meta was no different. Internal chat logs contained this exchange:

"Torrenting from a corporate laptop doesn't feel right..."

But they proceeded anyway. A December 2023 email revealed that LibGen usage had been "approved after escalation to MZ." MZ stands for Mark Zuckerberg.³

graph TD
    subgraph IllegalPath ["🔴 Illegal Path (2021-2023)"]
        A[LibGen/Pirate Library] --> B[Free Download]
        B --> C[AI Model Training]
    end

    subgraph LegalPath ["🟢 Legal Path (Project Panama 2024)"]
        D[Bulk Purchase from Used Bookstores] --> E[Destructive Scanning]
        E --> F[Recycling Disposal]
        F --> G[AI Model Training]
    end

    C --> H[Copyright Lawsuit]
    G --> I[Fair Use Recognized]

    style A fill:#ffcccc
    style D fill:#ccffcc

III – The $70 Billion Crisis and $1.5 Billion Settlement

The numbers tell the whole story.

Item	Figure
📚 Books Downloaded	~7 million
📖 Books in Settlement	482,460
💰 Potential Statutory Damages	$70B+ ($150,000/work)
💵 Final Settlement	$1.5 billion
👤 Average Author Compensation	~$3,000/book
⚖️ Copyright Lawsuits in 2025	70+

In June 2025, Judge William Alsup delivered a fascinating ruling:⁴

Scanning itself is legal. AI training processes copyrighted works in a "transformative" manner, qualifying as fair use. He analogized it to "teachers training schoolchildren to write well."

But illegal downloading is a separate matter. Downloading books from shadow libraries before Project Panama could constitute copyright infringement.

"AI training is 'quintessentially transformative': Anthropic's AI models were trained on works not to 'replicate or supplant them — but to turn a hard corner and create something different.'"[^4]

IV – Silicon Valley's Fallacy: "We Can, Therefore We Should"

Cornell Tech law professor James Grimmelmann's analysis cuts to the core:

"AI companies talked themselves into a fallacy."[^5]

The breakthroughs behind ChatGPT began in academic research. In academia, using copyrighted material for training is broadly accepted. But researchers continued the practice even after AI models were commercialized.

"By the time the tension became clear, they had made huge investments in incorporating copyrighted data into their pipelines and were locked in a fast-paced, high-stakes competition to release newer and better models."

Meta's internal email reveals this dilemma starkly:

"If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues."[^3]

They knew the risks and proceeded anyway. Why? Because falling behind in the competition was more terrifying.

V – The Creator's Debt: Who Should Receive What?

Ed Newton-Rex, former AI executive and music composer, now runs a nonprofit advocating for creators' rights. His message is clear:

"We urgently need a reset across the AI industry, such that creatives start being paid fairly for the vital contributions they make."[^1]

$3,000. The compensation for a single book.

How many years did the author spend writing that book? What about the editors, proofreaders, designers? And that knowledge was used to teach AI "how to write well."

The Compensation Imbalance

Party	Received	Contributed
🖊️ Authors	~$3,000/book	Years of research, writing, editing
🏢 Anthropic	$183B valuation	AI model development
🤖 AI Models	Millions of books' knowledge	Computing power

💭 Questions to Consider After Reading

How might AI companies' massive book acquisitions and shadow library utilization (like LibGen) influence legal frameworks surrounding copyright in digital content distribution?
How can destructive book scanning and shadow library practices impact copyright settlements in AI training, particularly focusing on innovative legal frameworks to balance creators' rights with technological advancement demands?
How do copyright lawsuits related to AI training data acquisition impact the doctrine of fair use, particularly in cases involving massive unauthorized book scanning by companies like Meta and OpenAI?

Share your thoughts in the comments.

Conclusion: The Price of Knowledge

AI writes well for a simple reason. It devoured the knowledge humanity accumulated over millennia.

The problem isn't the process. It's who pays the price.

Anthropic was credited with making a "smart call" legally. Switching from illegal downloads to legitimate purchases and scanning. But as Professor Grimmelmann noted, this came after they had already "made huge investments in incorporating copyrighted data into their pipelines."

2026 will bring more rulings. More lawsuits. And more creators discovering their work was consumed by AI.

The question is this:

"If AI has the right to train on humanity's collective knowledge, what should those who created that knowledge receive?"

Sources

If this post was helpful, please share it with one friend.

배움의 달인 Newsletter

The $1.5 Billion Library: How AI Devoured Humanity's Knowledge

The $1.5 Billion Library: How AI Devoured Humanity's Knowledge

Prologue: Must We Destroy Books to Make AI Smarter?

I – Project Panama: A Quiet Mass Destruction

II – Shadow Libraries: The Temptation of Piracy

III – The $70 Billion Crisis and $1.5 Billion Settlement

IV – Silicon Valley's Fallacy: "We Can, Therefore We Should"

V – The Creator's Debt: Who Should Receive What?

The Compensation Imbalance

💭 Questions to Consider After Reading

Conclusion: The Price of Knowledge

Sources

뉴스레터 서비스가 정식 시작되었습니다!

The $1.5 Billion Library: How AI Devoured Humanity's Knowledge

Prologue: Must We Destroy Books to Make AI Smarter?

I – Project Panama: A Quiet Mass Destruction

II – Shadow Libraries: The Temptation of Piracy

III – The $70 Billion Crisis and $1.5 Billion Settlement

IV – Silicon Valley's Fallacy: "We Can, Therefore We Should"

V – The Creator's Debt: Who Should Receive What?

The Compensation Imbalance

💭 Questions to Consider After Reading

Conclusion: The Price of Knowledge

Sources

Footnotes

뉴스레터 서비스가 정식 시작되었습니다!