Today I found a super helpful nixCraft post on how to block OpenAI from your site, at both the robots.txt level and the firewall level.

So, I adapted their firewall bash script into an Ansible snippet, which I’ve now added to my default playbook for setting up new servers.

It installs the ufw firewall, blocks the ChatGPT-User IP range referenced in OpenAI’s docs, and accesses OpenAI’s /gptbot.json endpoint to fetch and block the current list of IP ranges for GPTBot.

---
- name: Block ChatGPT's documented IP ranges from connecting to the server
  become: yes
  become_user: root
  tasks:
    - name: Install ufw firewall
      apt:
        name: ufw
        update_cache: yes

    - name: Configure ufw firewall to deny access to ChatGPT-User's IP range
      community.general.ufw:
        rule: deny
        src: 23.98.142.176/28
        comment: ChatGPT-User (https://platform.openai.com/docs/plugins/bot)

    - name: Load GPTBot IP ranges
      uri:
        url: https://openai.com/gptbot.json
      register: gptbot_info

    - name: Configure ufw firewall to deny access to each of GPTBot's IP ranges
      community.general.ufw:
        rule: deny
        src: "{{ item }}"
        comment: GPTBot (https://platform.openai.com/docs/gptbot)
      loop: "{{ gptbot_info['json'] |
        community.general.json_query('prefixes[*].ipv4Prefix') }}"

This won’t keep your server up-to-date with GPTBot’s latest IP ranges automatically, but it should add any new ranges every time you re-run this playbook.

It’s also worth noting that, strictly speaking, this script trusts OpenAI to accurately document its own IP ranges. That’s not because I trust OpenAI as a company, but because this is still better than not banning their documented ranges.

Bots can always ignore robots.txt or get new IP addresses, but OpenAI is in enough legal trouble that hopefully this can be one more filter to keep the biggest AI scraper off our content 🤞

Also, here’s the complete robots.txt file, to consider adding to your web root. This tells OpenAI, Google Bard, and Common Crawl that you do not authorize them to scrape your content.

User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /