BLOG POST

After Being Flooded by AI Crawlers, Wikipedia Waves the White Flag

 

The Knowledge Giant’s Unexpected Surrender

When it comes to Wikipedia, everyone is familiar with it.

You can often see it cited as a source in our materials. Whenever Super writes articles with historical or educational content, he checks Wikipedia’s explanations and then digs deeper using the references at the bottom to uncover more information points.

**Wikipedia is one of the most convenient and authoritative ways for ordinary people to understand a concept.**

Wikipedia’s operating organization is a non-profit called **Wikimedia Foundation**. In addition to Wikipedia, the organization also runs Wikimedia Commons, Wiktionary, Wikibooks, and other projects.

These projects are all free for everyone to use because Wikimedia’s core value is to make knowledge freely accessible and shareable.

**But recently, Wikimedia has truly been frustrated by AI companies.**

**These companies deployed countless AI crawlers to continuously scrape data from Wikimedia platforms to train their large language models.**

But you might not believe it: **Wikimedia didn’t sue these AI companies, but instead chose to—**

**Voluntarily surrender.**

**”Gentlemen, I’ve organized all the materials for you, please stop crawling our site.”**

Strategic Retreat: Providing Structured Data

Recently, Wikimedia hosted English and French Wikipedia content on the community platform Kaggle, telling those AI companies they could help themselves to the resources.

**Providing resources wasn’t enough; Wikipedia also wanted to serve these companies well by specifically optimizing the materials for AI models.**

Because machines are different from humans, pages that look clear and intuitive to us require more processing from machines to determine what each section means.

So Wikipedia reformatted the pages into structured JSON content, with titles, summaries, and explanations all arranged in a standardized format.

This makes it easier for AI to understand the content and data of each section, thus reducing costs for AI companies.

**This move is essentially Wikipedia setting out a platter of meat elsewhere to protect its home base from being overrun by the wolf pack.**

Super thinks Wikipedia’s approach is quite helpless.

**Back on April 1, they had already published a blog post complaining: since 2024, traffic for downloading multimedia content on the platform had increased by 50%.**

They initially thought people had become more eager to learn, but upon investigation, they found it was all AI companies’ crawlers. These crawlers continuously scraped resources to train large language models.

The Technical Burden of Unchecked Crawling

The impact of crawlers on Wikipedia is quite significant.

Wikimedia has multiple regional data centers around the world (in Europe, Asia, South America, etc.) and one core data center (in Ashburn, Virginia, USA).

The core data center stores all the materials, while regional data centers temporarily cache some popular entries.

What’s the benefit of this approach?

For example, if many people in Asia are looking up the word “Speed,” then “Speed” would be cached in the Asian regional data center.

This way, Asian users looking up “Speed” later would get **local express delivery** from the Asian data center, without needing to go through **international logistics** from the American data center.

This method of routing high-frequency entries through cheaper channels and low-frequency entries through more expensive channels not only improves loading speeds for users in different regions but also reduces server pressure for Wikimedia.

But the problem is: AI crawlers don’t care about any of this. **They want to access all entries, and they do it in batches.**

**This leads to constant traffic through expensive channels.**

Recently, Wikimedia discovered that 65% of the high-cost traffic going through American data centers was wasted on AI crawlers.

**Keep in mind that Wikipedia is free, but its servers aren’t—they cost $3 million annually to host.**

The Broader Issue of Web Crawling in the AI Era

**However, complaining probably didn’t help much, so a few weeks later, Wikimedia chose to organize its resources and host them on other platforms for AI companies to retrieve.**

Actually, not just Wikipedia, but content platforms, open-source projects, personal podcasts, and media websites have all encountered similar problems.

Last summer, the iFixit boss complained on Twitter that Claude’s crawlers visited their site 1 million times in a single day…

At this point, you might say, isn’t there a robots.txt protocol that can prevent AI crawlers from visiting your website by adding them to the protocol?

**Yes, after iFixit added Claude’s crawler to robots.txt, the crawling did pause (reduced to once every 30 minutes)**

In the past internet era, the robots protocol was indeed a permanent solution, and companies faced lawsuits for not complying.

**But now, this gentleman’s agreement is merely a paper tiger.**

Today’s large language model companies crawl as much as they can.

After all, if everyone else is crawling and you’re not, your corpus will be less powerful, putting your large language model at a disadvantage from the start.

So what do they do—

**They simply change the crawler’s name (user-agent). You said Luxun can’t crawl, but you didn’t say Zhou Shuren can’t.**

Are there large language models this shameless? There are plenty.

Previously, a Reddit user explicitly banned OpenAI’s crawler in their protocol, but OpenAI simply changed the name and continued crawling.

Similarly, Perplexity was caught by tech media WIRED completely ignoring the robots protocol.

Countermeasures and Future Directions

Over the years, people have been trying various new methods.

Some have developed techniques to place a dead link in the robots protocol—anyone who clicks on that link must be a crawler since normal users wouldn’t click on the protocol.

Others have chosen to use Web Application Firewalls (WAF) to identify malicious crawlers based on IP addresses, request patterns, and behavioral analysis.

**Some have decided to implement CAPTCHAs for their websites.**

But generally, these methods often result in an arms race. The more you resist, the more brutal crawling methods AI companies adopt.

That’s why Cloudflare, a cyber guardian, recently developed a technique to detect malicious crawlers and let them in.

**Of course, letting them in doesn’t mean feeding them well, but rather serving them a “wrong meal”—**

Providing a series of webpages unrelated to the targeted website, letting AI wade through them slowly.

Cloudflare’s approach is still relatively restrained.

**In January this year, someone created a more aggressive tool called Nepenthes (pitcher plant).**

Like how pitcher plants kill insects, “Nepenthes” traps AI crawlers in an “infinite maze” of static files with no exit links, preventing them from grabbing real content.

Not only that, but “Nepenthes” continuously feeds crawlers “Markov gibberish” to pollute AI training data. Reportedly, only OpenAI’s crawler can escape this technology currently.

Wow, it turns out the AI offense-defense battle has already begun at the source of large language model training.

**Of course, platforms can also reach agreements with AI companies.**

For instance, Reddit and Twitter have launched paid packages for AI companies: pay a certain amount monthly based on how many APIs you use and how many tweets you access.

**Some negotiations fail and end up in court. For example,** The New York Times sued OpenAI for scraping their articles after failed negotiations.

You might wonder: why doesn’t Wikipedia sue these AI crawlers?

Super speculates it might be related to Wikipedia’s nature.

**Wikipedia’s license is extremely open.**

Most of its content allows anyone (including AI companies) to freely use, copy, modify, and distribute under the conditions of attribution and sharing under the same license.

So from a legal perspective, AI companies scraping and using Wikipedia data for model training is likely legal.

Even if they took AI companies to court, there’s no clear legal boundary for AI infringement in the industry currently. This high-risk, high-cost, time-consuming option isn’t practical for Wikimedia.

Most importantly, Wikimedia’s mission is—to let every person on Earth freely access all knowledge.

Although server costs from AI crawlers are a problem, limiting others’ access to resources through legal means or commercial agreements might contradict their mission.

From this perspective, Wikimedia’s choice to organize data for AI companies to train with may be the most appropriate, albeit most helpless, approach.

Image sources and references:
https://x.com/kwiens/status/1816128302542905620
– Openai not respecting robots.txt and being sneaky about user agents
– Perplexity Is a Bullshit Machine
– The New York Times Sues OpenAI and Microsoft for Copyright Infringement
– AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt
– Trapping misbehaving bots in an AI Labyrinth
– Wikipedia is giving AI developers its data to fend off bot scrapers
– How crawlers impact the operations of the Wikimedia projects
– The journey to open our first data center in South America