Skip to main content

Pure-Python robots.txt parser with support for modern conventions

Project description

Supported Python Versions CI

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

Protego

RobotFileParser

Reppy

Robotexclusionrulesparser

Implementation language

Python

Python

C++

Python

Reference specification

Google

Martijn Koster’s 1996 draft

Wildcard support

Length-based precedence

Performance

+40%

+1300%

-25%

API Reference

Class protego.Protego:

Properties

  • sitemaps {list_iterator} A list of sitemaps specified in robots.txt.

  • preferred_host {string} Preferred host specified in robots.txt.

Methods

  • parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.

  • can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.

  • crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.

  • request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.

  • visit_time(user_agent) Return the visit time specified for the user agent as a named tuple VisitTime(start_time, end_time). If nothing is specified, return None.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protego-0.6.2.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protego-0.6.2-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file protego-0.6.2.tar.gz.

File metadata

  • Download URL: protego-0.6.2.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for protego-0.6.2.tar.gz
Algorithm Hash digest
SHA256 88ff004544ce44e61269cc6f8735f7837d12e09bb77619fc94fbb26b72e5d137
MD5 6a8f6bba5f9a52185fb3c60b09fe79c6
BLAKE2b-256 7d1b6b0ee60bb1561843bfc62f5c7c3cb1ef6147b47e829a5fd6b7fcd2752471

See more details on using hashes here.

Provenance

The following attestation bundles were made for protego-0.6.2.tar.gz:

Publisher: publish.yml on scrapy/protego

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file protego-0.6.2-py3-none-any.whl.

File metadata

  • Download URL: protego-0.6.2-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for protego-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 714de21d82527c9be900066c3211b266985dd6a19b6e70c57e033fc1a589f3ff
MD5 2428f0d81ddfab728d0629ea9265fae2
BLAKE2b-256 6110cbd3c06603bb8b3e0e871df24ee793a6e3b53d8c4cc8763f1b8b936649aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for protego-0.6.2-py3-none-any.whl:

Publisher: publish.yml on scrapy/protego

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page