this post was submitted on 27 Apr 2024

112 points (95.9% liked)

Linux

46015 readers

1189 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
No misinformation
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago

MODERATORS

[email protected]

112

Blocking AI crawlers with Caddy (lemmy.world)

submitted 3 months ago* (last edited 3 months ago) by [email protected] to c/[email protected]

19 comments fedilink hide all child comments

I was reading the reddit thread on Claude AI crawlers effectively DDOSing Linux Mint forums https://libreddit.lunar.icu/r/linux/comments/1ceco4f/claude_ai_name_and_shame/

and I wanted to block all ai crawlers from my selfhosted stuff.

I don't trust crawlers to respect the Robots.txt but you can get one here: https://darkvisitors.com/

Since I use Caddy as a Server, I generated a directive that blocks them based on their useragent. The content of the regex basically comes from darkvisitors.

Sidenote - there is a module for blocking crawlers as well, but it seemed overkill for me https://github.com/Xumeiquer/nobots

For anybody who is interested, here is the block_ai_crawlers.conf I wrote.

(blockAiCrawlers) {
  @blockAiCrawlers {
    header_regexp User-Agent "(?i)(Bytespider|CCBot|Diffbot|FacebookBot|Google-Extended|GPTBot|omgili|anthropic-ai|Claude-Web|ClaudeBot|cohere-ai)"
  }
  handle @blockAiCrawlers {
    abort
  }
}

# Usage:
# 1. Place this file next to your Caddyfile
# 2. Edit your Caddyfile as in the example below
#
# ```
# import block_ai_crawlers.conf
#
# www.mywebsite.com {
#   import blockAiCrawlers
#   reverse_proxy * localhost:3000
# }
# ```

top 19 comments

sorted by: hot top controversial new old

[–] [email protected] 64 points 2 months ago (7 children)

I got meaner with them :3c

[–] [email protected] 20 points 2 months ago (1 children)

I just want you to know that was an amazing read, was actually thinking "It gets worse? Oh it does. Oh, IT GETS EVEN WORSE?"

[–] [email protected] 4 points 2 months ago

lmao that means a lot, thanks <3

[–] [email protected] 10 points 2 months ago

The nobots module I've linked bombs them

[–] [email protected] 9 points 2 months ago

Suggestion at the end:

  <a class="boom" href="https://boom .arielaw.ar">hehe</a>

Wouldn't it destroy GoogleBot (and other search engine) those making your site delisted from Search?

[–] [email protected] 8 points 2 months ago* (last edited 2 months ago) (1 children)

In dark mode, the anchor tags are difficult to read. They're dark blue on a dark background. Perhaps consider something with a much higher contrast?

A picture of a website with a dark purple background and dark blue links.

Apart from that, nice idea - I'm going to deploy the zipbomb today!

[–] [email protected] 5 points 2 months ago

nice catch, thanks (i use light mode most of the time)

[–] [email protected] 7 points 2 months ago

This is one of the best things I've ever read.

I'd love to see a robots.txt do a couple safe listings, then a zip bomb, then a safe listing. It would be fun to see how many log entries from an IP look like get a, get b, get zip bomb.... no more requests.

[–] [email protected] 5 points 2 months ago (1 children)

I really like your site's color scheme, fonts, and overall aesthetics. Very nice!

[–] [email protected] 2 points 2 months ago

I agree, it's readable and very cute!

[–] [email protected] 4 points 2 months ago

That’s devilishly and deliciously devious.

[–] [email protected] 26 points 2 months ago (2 children)

I'm a fan of hellpotting them.

[–] [email protected] 3 points 2 months ago

From your recommendation, I found a related project pandoras_pot that I am able to run in a Docker container, and seems to run more efficiently on my Pi home server. I now use it in my Caddyfile to redirect a number of fake subdomains and paths that are likely to be found by a malicious bot (of course all are excluded in my robots.txt for bots that actually respect it). Thanks for the recommendation!

[–] [email protected] 2 points 2 months ago

Ooh, didn't know about that one... thanks

[–] [email protected] 19 points 3 months ago (3 children)

We should do more than block them, they need to be teergrubed.

[–] [email protected] 12 points 3 months ago* (last edited 3 months ago)

Thats an easy modification. Just redirect or reverse proxy to the tarpit instead of abort .

I was even thinking about an infinitely linked data-poisoned html document, but there seemed to be no ready made project that can generate one at the moment. (No published data-poisoning techniques for plain text at all afaik. But there is one for images.)

Ultimately I decided to just abort the connection as I don't want my servers to waste traffic or CPU cycles.

[–] [email protected] 4 points 3 months ago

Such a cool person making the video available for download

[–] [email protected] 1 points 2 months ago

Your link has no article, and Video inside Flash file (swf) that itn't opening in 2024.

And I don't want to install Flash on my machine...

[–] [email protected] 4 points 2 months ago

Huh, looks like the post in r/linux got removed for not being relevant.
What a joke.