When AI Bots Break the Rules: Lessons from Perplexity’s Stealth Crawling

Artificial intelligence is reshaping how we access and use information, but with that power comes responsibility. Recent findings by Cloudflare and investigative reporting from CyberScoop have revealed troubling behavior by Perplexity, an AI-powered answer engine, that challenges the ethical foundation of AI data practices.

????️‍♂️ The Incident: Crawling Behind Closed Doors

Cloudflare discovered that Perplexity’s crawlers were accessing content even when websites explicitly blocked them via robots.txt and firewall rules. To verify, Cloudflare created private “honeytrap” domains, completely undiscoverable and locked down from bots. When Perplexity returned answers sourced directly from these restricted sites, the evidence was clear—these crawlers were bypassing protections.

???? Cloaked Crawls and Evasion Tactics

Rather than respecting access rules, Perplexity reportedly:

  • Impersonated regular browsers like Chrome to avoid detection
  • Rotated IP addresses and hosting networks to slip past filters
  • Ignored robots.txt and other site owner directives

These tactics suggest deliberate avoidance of web standards designed to foster trust between site owners and automated crawlers.

⚠️ Why This Matters: Trust Is Fragile

The web relies on a shared understanding: crawlers identify themselves, respect boundaries, and play by the rules. When an AI company violates these norms, it doesn’t just break trust with site owners—it undermines the integrity of the entire ecosystem. Cloudflare’s response was decisive, blocking the offending bots and stripping Perplexity of its “verified” crawler status.

✅ A Contrast in Behavior: OpenAI’s Approach

Interestingly, Cloudflare highlighted that OpenAI’s bots adhered to site instructions, backing off when told not to crawl. This difference underscores an important point: compliance is not optional—it’s a baseline expectation.


???? My Take: Innovation Needs Boundaries

AI tools like Perplexity hold incredible potential to enhance our access to knowledge, but cutting-edge technology is not a license to bypass rules. Web standards exist to protect the rights of content creators, maintain trust, and ensure that innovation benefits everyone—not just the companies pushing boundaries.

Breaking these rules in the name of progress is shortsighted. True innovation respects the ecosystem it operates in. Ethical AI providers must prioritize transparency, consent, and respect for established norms. Anything less risks eroding the trust they depend on to thrive.


???? Lessons for Website Owners and AI Companies

  1. Website Owners:
    • Monitor crawler activity closely and use tools like Cloudflare’s WAF to enforce boundaries.
    • Consider new “pay-per-crawl” models that allow compensation when AI systems use your data.
  2. AI Companies:
    • Respect robots.txt and other site policies—these are not suggestions.
    • Be transparent about data collection practices to build long-term trust.
    • Remember: being on the cutting edge does not grant carte blanche to break the rules.

???? Moving Forward

The Perplexity case is a wake-up call. The future of AI must be built not only on technological advances but also on ethical conduct. The companies that will ultimately lead this space will be those that respect the boundaries of others while pushing the limits of what’s possible.


Related news on AI crawler control

Paul Bergman
Follow me