Crawl with care: How to gather AI training data sustainably
As AI bots flood the internet for training data, website owners face server overload and rising costs. Learn how ethical, sustainable web crawling can help protect the open web.
The web is worried. In 2025, website owners began noticing something different in their server logs – a flood of traffic from a new generation of crawler, scouring the web for AI training data
While new traffic is commonly seen as good and the new wave of chat bots and other AI products are ground-breaking, for some site operators, automated requests from AI services have also been heaping unprecedented demand on web infrastructure.
One engineer reported over 70% of this traffic now comes from AI-linked user agents, with search bots now barely registering.
Another site owner received 81,000 requests in just two and a half hours, 97% of which were reckoned to be AI bots.
Wikimedia Foundation, one of the largest publishers in the world, is struggling under the weight. Between January 2024 and April 2025, its bandwidth used to serve images surged by 50%, as automated bots ingested its openly licensed content.
No wonder some site owners are beginning to feel “under attack”. Sudden traffic increases can overwhelm servers, leading to downtime, extended hosting bills, lower usage and lost business.
Ethical standards for the conventional gathering of web data are already in common use, but the infrastructure hasn’t seen anything on this scale before – and that is raising a range of questions about how businesses can build the next generation of AI services while also being good citizens of the web.
Understanding the new breed of crawlers
To answer those questions, we need to understand what scrapers for AI services are really doing.
Scraping of any kind involves three steps:
When you don’t have a defined list of target URLs, a crawler systematically explores one or more websites, in search of relevant pages. It follows links, collecting target URLs.
Fetching is a highly targeted practice. Fetching tools like curl or wget download specific pages, whether pre-defined or obtained by a crawler, when you tell them to.
Once obtained, extracting pulls sought-after information from those pages, for storage in a system.
While conventional scraping usually fetches a limited set of pages with a goal to extract only what’s relevant (be it search, real estate, product, job or news data), crawlers acting for Large Language Models (LLMs), by contrast, pursue exhaustiveness, fetching and extracting as much as possible.
According to the guiding assumption for training LLMs, the scaling hypothesis, more data produces better models. And LLMs can make meaning from any kind of content they discover, so their crawlers are keen to collect everything – structured or unstructured, polished or raw.
We have seen reports of crawlers that hit the same low-value pages – log files, diffs, redundant URLs – at high frequency, without any prioritization, or scheduling logic. Others run in loops, requesting the same pages over and over throughout the day.
The difference between scraping for AI services and standard web data acquisition, then, is one of scale.
While these patterns may not suggest deliberate abuse, they might suggest a need for nuance. Polite, efficient crawling requires engineering care and operational expertise.
The impact of overwhelm
Operating without that nuance could hurt the sustainability of the very web content we all know and depend upon.
Some website owners have started to ban entire ranges of IPs linked to AI services.
Some set honeypots, tarpits, and proof-of-work challenges to waste compute time and slow the crawlers down.
Some employ aggressive tactics like content poisoning, where junk data is served to corrupt large-scale training sets.
Micropayments, login walls, and CAPTCHAs are spreading, too – adding layers of friction that makes public content harder to access.
Google itself has started shielding its search result pages behind JavaScript — a move some think reflects this new AI-driven pressure
Even infrastructure providers are adapting. Cloudflare, for example, has started deploying AI traps that force the crawlers to ingest irrelevant AI-generated data - an AI arms race we call the “Turing tango”.
But these defenses don’t just hamper AI bots. They also hinder others: archivists, researchers, competitive analysts, journalists, academia, and real individual users.
The long-term risk is hard to ignore: the open web becomes less open..
All of us, then – AI developers, technology vendors, publishers, and the public – share an interest in keeping the web functional, fair, and open.
How Zyte crawls carefully
It is possible to build innovative, large-scale AI systems in a way that treats website owners gently and maximises data sustainability, by implementing appropriate technical measures.
This is a landscape Zyte knows well. Since 2014, Zyte has worked with organizations of all sizes – from startups to global enterprises – to gather web data for use cases including AI training in a careful, considered way.
We have explored and enabled those responsibilities at the technical level, long before the recent AI explosion put them in the spotlight. Here’s how we put this experience into action.
Principles of politeness
Step one in Zyte’s best practices for web scraping is: “Don’t be a burden.” Like the medical profession’s Hippocratic oath to “do no harm”, it urges restraint and care:
“Limit the number of concurrent requests to the same website from a single IP.”
“Respect the delay that crawlers should wait between requests by following the Crawl-delay directive outlined in the robots.txt file.”
“If possible, it is more respectful if you can schedule your crawls to take place at the website’s off-peak hours.”
But having the right principles is only the start. You also need tools with the power to turn those values into action, without having to build every safeguard yourself.
Politeness as a product
Careful crawling is baked into Zyte’s products and policies.
Zyte API, our full-stack web scraping API, limits the rate at which websites can be accessed, to safeguard sites against overwhelm.
Every website has its own unique tolerance for traffic. Zyte has a long-developed an intimate understanding of those tolerances and adheres to website-specific rate limits, ensuring websites’ capacity is respected in a personal way.
Each Zyte API account also has its own rate limit for each website, so that no single account can flood any given website.
As a back-stop, all Zyte API users are also limited to 500 requests per minute.
Behind the scenes, Zyte’s team monitors combined usage across all API entry points to ensure these limits are consistently enforced.
While these practices lighten the load on websites, they don’t limit our customers’ data ambitions. Not only does spreading the load across our network actually increase the chance our more demanding users get the results they need; in each case, users can also request an increase to their rate limit – something that is subject to a Zyte review process to assess the likely impact on target sites.
Together, these are expressions of fairness as a design principle — ensuring that no one player can dominate or exhaust the shared resource of public web data, while demanding users retain bandwidth to gather the data they need.
Servicing with sensitivity
While our principles and products help our software customers crawl carefully themselves, the experts at Zyte Data, our done-for-you data extraction service for large-scale data buyers, also go to considerable lengths to tread carefully on clients’ behalf.
For example, Zyte has recently been helping one scaled AI company gather a large amount of training data. Having seen the unfolding project docs and chats, I have been impressed by the diligence team members have shown in assessing the impact of their crawling.
I have seen colleagues estimating target sites’ traffic levels in order to consider the impression of their own footprint. I have seen project managers and developers reject temptations to exceed polite limits, deciding instead to reduce concurrency.
When the job demands greater frequency, the internal Zyte Data team states its case to increase a rate limit using the same request process as external Zyte API customers – a process laid out in a clearly documented workflow that includes an evaluation containing fine-grained consideration of the proposal and requiring senior-level approval.
Regular cross-team reviews, proactive limit-setting, and a culture of responsible web use underpin these practices, ensuring enterprise customers achieve optimum data while websites, large and small, stay resilient.
Hands-on controls
Zyte’s responsible approach extends to the wider ecosystem.
Scrapy, the open-source web scraping framework that Zyte’s founders kickstarted and another way in which many AI developers gather training data, builds care for capacity into its core.
It offers a wide set of adjustable levers to help you scale data collection with care.
Disclosing crawler identity helps servers recognize and, if they choose, manage a crawler’s access. In Scrapy, this is easy with the USER_AGENT and ROBOTSTXT_* settings. These work hand-in-hand with the robots.txt rules. The scrapy.downloadermiddlewares.robotstxt middleware reinforces this by automatically skipping forbidden URL paths.
Controlling speed and concurrency is where Scrapy shines. With eight different crawl speed-related settings and an adaptive AutoThrottle extension, data gatherers can fine-tune how fast their crawlers move to strike a balance between efficiency and caution.
Cache management is one of the most effective yet often overlooked ways to reduce server strain. Instead of repeatedly downloading unchanged data, only refresh when needed. Scrapy’s httpcache middleware stores and reuses HTTP responses based on RFC2616 standards. It honors cache directives (no-store, no-cache) and calculates freshness using Last-Modified, ETag, and Expires headers.
Rather than pushing everything to the limit, Scrapy helps businesses fine-tune their approach to AI training data, balancing speed and respect for the target servers, without building everything from the ground up.
A web worth sustaining
The web was designed to be open. That openness has made it resilient, generative, and surprising. It’s what made it possible for voices to be heard and systems to be built.
Letting that platform degrade—through neglect or overuse—means worse outputs, narrower perspectives, and less useful results. A resilient, open web is in everyone’s interest, especially those building on top of it.
The web deserves good citizens. We all have a part to play.