Skip to main content

Pure-Python robots.txt parser with support for modern conventions

Project description

Supported Python Versions CI

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

Protego

RobotFileParser

Reppy

Robotexclusionrulesparser

Implementation language

Python

Python

C++

Python

Reference specification

Google

Martijn Koster’s 1996 draft

Wildcard support

Length-based precedence

Performance

+40%

+1300%

-25%

API Reference

Class protego.Protego:

Properties

  • sitemaps {list_iterator} A list of sitemaps specified in robots.txt.

  • preferred_host {string} Preferred host specified in robots.txt.

Methods

  • parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.

  • can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.

  • crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.

  • request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.

  • visit_time(user_agent) Return the visit time specified for the user agent as a named tuple VisitTime(start_time, end_time). If nothing is specified, return None.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protego-0.4.0.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

Protego-0.4.0-py2.py3-none-any.whl (8.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file protego-0.4.0.tar.gz.

File metadata

  • Download URL: protego-0.4.0.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for protego-0.4.0.tar.gz
Algorithm Hash digest
SHA256 93a5e662b61399a0e1f208a324f2c6ea95b23ee39e6cbf2c96246da4a656c2f6
MD5 88cf91f9691acb9bfb12fcedb4b8b8c9
BLAKE2b-256 4e6b84e878d0567dfc11538bad6ce2595cee7ae0c47cf6bf7293683c9ec78ef8

See more details on using hashes here.

Provenance

The following attestation bundles were made for protego-0.4.0.tar.gz:

Publisher: publish.yml on scrapy/protego

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file Protego-0.4.0-py2.py3-none-any.whl.

File metadata

  • Download URL: Protego-0.4.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for Protego-0.4.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 37640bc0ebe37572d624453a21381d05e9d86e44f89ff1e81794d185a0491666
MD5 0629c78eb8ce0d14501643a75d609744
BLAKE2b-256 d9fd8d84d75832b0983cecf3aff7ae48362fe96fc8ab6ebca9dcf3cefd87e79c

See more details on using hashes here.

Provenance

The following attestation bundles were made for Protego-0.4.0-py2.py3-none-any.whl:

Publisher: publish.yml on scrapy/protego

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page