Skip to main content

Make it slightly harder for bots to steal your content

Project description

MkDocs Anti AI Scraper Plugin

This plugin tries to prevent AI scrapers from easily ingesting your website's contents. It is probably implemented pretty badly and by design it can be bypassed by anyone that invests a bit of time, but it is probably better than nothing.

Installation

Implemented Techniques

robots.txt

This technique is enabled by default, and can be disabled by setting the option robots_txt: False in mkdocs.yml. If enabled, it adds a robots.txt with the following contents to the output directory:

User-agent: *
Disallow: /

This hints to crawlers that they should not crawl your site.

This technique does not hinder normal users from using the site at all. However, the robots.txt is not enforcing anything. It just tells well-behaved bots how you would like them to behave. Many bots may just ignore it ((Source)[https://www.tomshardware.com/tech-industry/artificial-intelligence/several-ai-companies-said-to-be-ignoring-robots-dot-txt-exclusion-scraping-content-without-permission-report]).

Planned Techniques

  • Encoding the page contents and decode with JS: Will prevent basic HTML parsers from getting the contents, but anything using a browser (selenium, pupeteer, etc) will still work.
  • Encrypt page contents and adding client side "CAPTCHA" to generate the key: Should help against primitive browser based bots. It would probably make sense to just let the user solve the CAPTCHA once and cache the key as a cookie or in localStorage.
  • Bot detection JS: Will be a cat and mouse game, but should help against badly written crawlers

Suggestions welcome: If you know bot detection mechanisms, that can be used with static websites, feel free to open an issue :D

Problems and Considerations

  • Similar to the encryption plugin, the encryption of the search index is hard. So best disable search to prevent anyone from accessing it.
  • Obviously, to protect your contents from scraping, you should not have their source code hosted in public repos ;D
  • By blocking bots, you also prevent search engines like Google from properly endexing your site.

Development Commands

This repo is managed using poetry. You can install poetry with pip install poetry or pipx install poetry.

Clone repo:

git clone git@github.com:six-two/mkdocs-anti-ai-scraper-plugin.git

Install/update extension locally:

poetry install

Build test site:

poetry run mkdocs build

Serve test site:

poetry run mkdocs serve

Release

Build extension:

poetry build

Upload extension:

poetry publish

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mkdocs_anti_ai_scraper_plugin-0.0.1.tar.gz (2.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mkdocs_anti_ai_scraper_plugin-0.0.1-py3-none-any.whl (3.6 kB view details)

Uploaded Python 3

File details

Details for the file mkdocs_anti_ai_scraper_plugin-0.0.1.tar.gz.

File metadata

File hashes

Hashes for mkdocs_anti_ai_scraper_plugin-0.0.1.tar.gz
Algorithm Hash digest
SHA256 dca65698d8da3afda8b42191391f4d1f81831d3cd0de7e958033858576c70e86
MD5 f05f17fb4428988b040e1bca78d07975
BLAKE2b-256 0f948d0cddc263d774a9e2f5f65031c50a3167ff7fd32a126e820223171cb7d6

See more details on using hashes here.

File details

Details for the file mkdocs_anti_ai_scraper_plugin-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for mkdocs_anti_ai_scraper_plugin-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 be6489deec3c7c6cf6e20684d94c91d0658c4954b6668a0d78ba7dd8e41b0ec5
MD5 bd170800c9e2c10c1334007e9890b98c
BLAKE2b-256 0e153d0015ef9e13db457b529f8ce81e1e79a4b94b8a534cbe24b81a9587c141

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page