Pure-Python robots.txt parser with support for modern conventions
Project description
Protego is a pure-Python robots.txt parser with support for modern conventions.
Install
To install Protego, simply use pip:
pip install protego
Usage
>>> from protego import Protego >>> robotstxt = """ ... User-agent: * ... Disallow: / ... Allow: /about ... Allow: /account ... Disallow: /account/contact$ ... Disallow: /account/*/profile ... Crawl-delay: 4 ... Request-rate: 10/1m # 10 requests every 1 minute ... ... Sitemap: http://example.com/sitemap-index.xml ... Host: http://example.co.in ... """ >>> rp = Protego.parse(robotstxt) >>> rp.can_fetch("http://example.com/profiles", "mybot") False >>> rp.can_fetch("http://example.com/about", "mybot") True >>> rp.can_fetch("http://example.com/account", "mybot") True >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot") False >>> rp.can_fetch("http://example.com/account/contact", "mybot") False >>> rp.crawl_delay("mybot") 4.0 >>> rp.request_rate("mybot") RequestRate(requests=10, seconds=60, start_time=None, end_time=None) >>> list(rp.sitemaps) ['http://example.com/sitemap-index.xml'] >>> rp.preferred_host 'http://example.co.in'
Using Protego with Requests:
>>> from protego import Protego >>> import requests >>> r = requests.get("https://google.com/robots.txt") >>> rp = Protego.parse(r.text) >>> rp.can_fetch("https://google.com/search", "mybot") False >>> rp.can_fetch("https://google.com/search/about", "mybot") True >>> list(rp.sitemaps) ['https://www.google.com/sitemap.xml']
Comparison
The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:
Protego |
RobotFileParser |
Reppy |
Robotexclusionrulesparser |
|
---|---|---|---|---|
Implementation language |
Python |
Python |
C++ |
Python |
Reference specification |
||||
✓ |
✓ |
✓ |
||
✓ |
✓ |
|||
+40% |
+1300% |
-25% |
API Reference
Class protego.Protego:
Properties
sitemaps {list_iterator} A list of sitemaps specified in robots.txt.
preferred_host {string} Preferred host specified in robots.txt.
Methods
parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.
can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.
crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.
request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for Protego-0.2.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 04419b18f20e8909f1691c6b678392988271cc2a324a72f9663cb3af838b4bf7 |
|
MD5 | 7b31da10f4c46481b45d41a7639f124a |
|
BLAKE2b-256 | 814d3e01f10d6dd2d35793711c2e27a07e547c6aec0ab8d3199bb83e68956fdb |