phil@bajsicki:~$


Ineptly defending against techbro greed

News from today: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives (Cloudflare)

In response to this, I finally broke down and started thinking about how I can protect my writing from LLM scrapers.

I checked my logs. I got a few hits. However, they’re inconclusive because of course, big techbro company doesn’t want to own up to their own scrapers. I didn’t get any hits from Perplexity’s ‘declared’ crawler user-agent. I assume that’s because I had already explicitly blocked it in my Caddy config.

So what then?

The IP addresses I was hit from:

Shodan Provider ASN My suspicion
20.163.3.88 Azure (westus3) AS8075 Port 22 open, obvious scraper.
64.226.124.70 DigitalOcean AS14061 Port 22 open, obvious scraper.
91.96.30.45 EWE-TEL GmbH AS9145 Ports 5060 and 8089 open. Possible scraper.
206.189.185.221 Host seemingly offline.

As you can see, it’s a good range. Key issue: none of these are regular retail ISP lines. Now why would I get this funky user-agent from what looks to be a bunch of VMs?

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36

I set up Caddy Defender, and I’m hoping that’s sufficient. For now, I’m blocking most of the large infrastructure providers. I assume that since I can’t access my website from my work’s VPN, I should be okay… ish. Maybe.

My writing is for people. If corporations wish to use it for any reason, they better act in good faith and license it like they are supposed to. But this isn’t even the first time Perplexity did this. Their CEO, Aravind Srinivas, outright admitted to doing just this a year ago.

Perplexity was not given (and will never be given) permission or license to process my writing, personal data (which is directly tied to my writing) or else.

I don’t like this very much. Here’s hoping the law catches up, because I sure don’t like being abused like this.