Blocking OpenAI from accessing your server, with Ansible
Today I found a
super helpful nixCraft post
on how to block OpenAI from your site, at both the
robots.txt
level and the firewall level.
So, I adapted their firewall bash script into an Ansible snippet, which I’ve now added to my default playbook for setting up new servers.
It installs the
ufw firewall, blocks the ChatGPT-User IP range
referenced in OpenAI’s docs, and accesses OpenAI’s
/gptbot.json
endpoint
to fetch and block the current list of IP ranges for GPTBot.
---
- name: Block ChatGPT's documented IP ranges from connecting to the server
become: yes
become_user: root
tasks:
- name: Install ufw firewall
apt:
name: ufw
update_cache: yes
- name: Configure ufw firewall to deny access to ChatGPT-User's IP range
community.general.ufw:
rule: deny
src: 23.98.142.176/28
comment: ChatGPT-User (https://platform.openai.com/docs/plugins/bot)
- name: Load GPTBot IP ranges
uri:
url: https://openai.com/gptbot.json
register: gptbot_info
- name: Configure ufw firewall to deny access to each of GPTBot's IP ranges
community.general.ufw:
rule: deny
src: "{{ item }}"
comment: GPTBot (https://platform.openai.com/docs/gptbot)
loop: "{{ gptbot_info['json'] |
community.general.json_query('prefixes[*].ipv4Prefix') }}"
This won’t keep your server up-to-date with GPTBot’s latest IP ranges automatically, but it should add any new ranges every time you re-run this playbook.
It’s also worth noting that, strictly speaking, this script trusts OpenAI to accurately document its own IP ranges. That’s not because I trust OpenAI as a company, but because this is still better than not banning their documented ranges.
Bots can always ignore robots.txt or get new IP addresses, but OpenAI is in enough legal trouble that hopefully this can be one more filter to keep the biggest AI scraper off our content 🤞
Also, here’s the complete
robots.txt
file, to consider adding to your web root. This tells OpenAI,
Google Bard, and Common Crawl that you do not authorize them to
scrape your content.
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /