Alright, let's get one thing straight: this whole "AI revolution" smells like warmed-over garbage. And the Common Crawl Foundation? Turns out they're the garbage truck drivers, quietly shoveling our intellectual property into the maws of these tech behemoths.
The Great Data Heist
So, this "nonprofit" Common Crawl has been scraping the web for years, building a massive archive under the guise of "research." Except, surprise, surprise, it’s been a free buffet for OpenAI, Google, Nvidia, and the rest of the AI gang. They're hoovering up paywalled articles – the stuff journalists actually get paid to write – to train their language models.
And Common Crawl is straight-up lying about it. They claim they don't go behind paywalls, but the Atlantic article makes it pretty clear they do. It’s like saying you're not robbing a bank because you’re just "borrowing" the money indefinitely. Rich Skrenta, Common Crawl's executive director, even has the audacity to say "the robots are people too" and should get to "read the books" for free. Give me a break. They ain't people, they're algorithms designed to make already obscenely wealthy corporations even wealthier.
Multiple news publishers have asked them to remove their content. Common Crawl says they comply. But they don't. Of course.
It's all about the data, people. As Stefan Baack, a researcher, put it, "Generative AI in its current form would probably not be possible without Common Crawl." Think about that. The entire AI boom is built on a foundation of stolen content. According to a recent article in The Company Quietly Funneling Paywalled Articles to AI Developers, Common Crawl's practices are under increased scrutiny.
The "Open Web" Scam
Skrenta claims publishers are "making a mistake" by excluding themselves from "Search 2.0." Translation: "We're going to steal your content anyway, so you might as well let us." He even had the nerve to say, "You shouldn’t have put your content on the internet if you didn’t want it to be on the internet." What kind of logic is that? It’s like saying you shouldn’t have bought a car if you didn’t want it to be stolen.

And the fact that they are taking donations from OpenAI and Anthropic offcourse raises some questions, dontcha think?
Common Crawl is pulling the same old techno-libertarian crap about "information wants to be free." Stewart Brand's quote is being twisted again. It’s not about freedom; it’s about corporations not wanting to pay for anything.
I mean, the Atlantic isn't a crucial part of the internet? Seriously? This Skrenta guy sounds like he’s never cracked open a real newspaper in his life. Maybe that's the problem.
They even have the gall to say they want to put the archive on a "crystal cube" on the moon. What a load of self-serving bull.
The Illusion of Removal
Here's the kicker: Common Crawl can't even remove the content they've scraped. Skrenta admits the file format is "immutable." So, all those requests from publishers? Basically ignored. And the search function on their website? A complete lie. It shows "no captures" for domains that are clearly present in the archive. It's a magic trick, designed to make publishers think their content is safe.
But wait... are we really supposed to believe that it's impossible to remove data from an archive? I mean, come on. If they really wanted to, they could figure it out. They just don't want to.
