The $1.5 Billion Library: How AI Devoured Humanity's Knowledge
The $1.5 Billion Library: How AI Devoured Humanity's Knowledge
Prologue: Must We Destroy Books to Make AI Smarter?
Everyone marvels at AI's remarkable writing abilities.
But I have a different thought.
Millions of books had their spines sliced off by industrial cutting machines to pay for that ability.
In early 2024, an AI startup called Anthropic launched a classified project. According to internal documents: "Project Panama is our effort to destructively scan all the books in the world. We don't want it to be known that we are working on this."1
In summer 2025, this secret exploded in court. The result? $1.5 billion in settlement—the largest copyright settlement in American history.2
I – Project Panama: A Quiet Mass Destruction
Anthropic spent tens of millions of dollars acquiring millions of books. Bulk purchases of tens of thousands at a time from used bookstores like Better World Books and World of Books.
What happened to those books?
"A hydraulic powered cutting machine 'neatly cut' the books, pages were scanned on high speed, high quality production level scanners, and finally a recycling company picked up the completed books."[^1]
Between 500,000 and 2 million books. In six months. That was Project Panama's scale.
Why go this far?
An Anthropic co-founder theorized in a 2023 document: training AI on books could teach them "how to write well" instead of mimicking "low quality internet speak."
Books are different from internet garbage. Edited, vetted, refined knowledge. For AI companies, books were goldmines.
II – Shadow Libraries: The Temptation of Piracy
Before Project Panama, there was a darker history.
Anthropic co-founder Ben Mann downloaded fiction and nonfiction from a "shadow library" called LibGen over 11 days in June 2021.1 LibGen is an illegal database of pirated books.
A year later, when a new pirate site emerged, he sent colleagues a link with the message:
"Just in time!!!"
Meta was no different. Internal chat logs contained this exchange:
"Torrenting from a corporate laptop doesn't feel right..."
But they proceeded anyway. A December 2023 email revealed that LibGen usage had been "approved after escalation to MZ." MZ stands for Mark Zuckerberg.3
graph TD
subgraph IllegalPath ["🔴 Illegal Path (2021-2023)"]
A[LibGen/Pirate Library] --> B[Free Download]
B --> C[AI Model Training]
end
subgraph LegalPath ["🟢 Legal Path (Project Panama 2024)"]
D[Bulk Purchase from Used Bookstores] --> E[Destructive Scanning]
E --> F[Recycling Disposal]
F --> G[AI Model Training]
end
C --> H[Copyright Lawsuit]
G --> I[Fair Use Recognized]
style A fill:#ffcccc
style D fill:#ccffcc
III – The $70 Billion Crisis and $1.5 Billion Settlement
The numbers tell the whole story.
| Item | Figure |
|---|---|
| 📚 Books Downloaded | ~7 million |
| 📖 Books in Settlement | 482,460 |
| 💰 Potential Statutory Damages | $70B+ ($150,000/work) |
| 💵 Final Settlement | $1.5 billion |
| 👤 Average Author Compensation | ~$3,000/book |
| ⚖️ Copyright Lawsuits in 2025 | 70+ |
In June 2025, Judge William Alsup delivered a fascinating ruling:4
Scanning itself is legal. AI training processes copyrighted works in a "transformative" manner, qualifying as fair use. He analogized it to "teachers training schoolchildren to write well."
But illegal downloading is a separate matter. Downloading books from shadow libraries before Project Panama could constitute copyright infringement.
"AI training is 'quintessentially transformative': Anthropic's AI models were trained on works not to 'replicate or supplant them — but to turn a hard corner and create something different.'"[^4]
IV – Silicon Valley's Fallacy: "We Can, Therefore We Should"
Cornell Tech law professor James Grimmelmann's analysis cuts to the core:
"AI companies talked themselves into a fallacy."[^5]
The breakthroughs behind ChatGPT began in academic research. In academia, using copyrighted material for training is broadly accepted. But researchers continued the practice even after AI models were commercialized.
"By the time the tension became clear, they had made huge investments in incorporating copyrighted data into their pipelines and were locked in a fast-paced, high-stakes competition to release newer and better models."
Meta's internal email reveals this dilemma starkly:
"If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues."[^3]
They knew the risks and proceeded anyway. Why? Because falling behind in the competition was more terrifying.
V – The Creator's Debt: Who Should Receive What?
Ed Newton-Rex, former AI executive and music composer, now runs a nonprofit advocating for creators' rights. His message is clear:
"We urgently need a reset across the AI industry, such that creatives start being paid fairly for the vital contributions they make."[^1]
$3,000. The compensation for a single book.
How many years did the author spend writing that book? What about the editors, proofreaders, designers? And that knowledge was used to teach AI "how to write well."
The Compensation Imbalance
| Party | Received | Contributed |
|---|---|---|
| 🖊️ Authors | ~$3,000/book | Years of research, writing, editing |
| 🏢 Anthropic | $183B valuation | AI model development |
| 🤖 AI Models | Millions of books' knowledge | Computing power |
💭 Questions to Consider After Reading
-
How might AI companies' massive book acquisitions and shadow library utilization (like LibGen) influence legal frameworks surrounding copyright in digital content distribution?
-
How can destructive book scanning and shadow library practices impact copyright settlements in AI training, particularly focusing on innovative legal frameworks to balance creators' rights with technological advancement demands?
-
How do copyright lawsuits related to AI training data acquisition impact the doctrine of fair use, particularly in cases involving massive unauthorized book scanning by companies like Meta and OpenAI?
Share your thoughts in the comments.
Conclusion: The Price of Knowledge
AI writes well for a simple reason. It devoured the knowledge humanity accumulated over millennia.
The problem isn't the process. It's who pays the price.
Anthropic was credited with making a "smart call" legally. Switching from illegal downloads to legitimate purchases and scanning. But as Professor Grimmelmann noted, this came after they had already "made huge investments in incorporating copyrighted data into their pipelines."
2026 will bring more rulings. More lawsuits. And more creators discovering their work was consumed by AI.
The question is this:
"If AI has the right to train on humanity's collective knowledge, what should those who created that knowledge receive?"
Sources
If this post was helpful, please share it with one friend.