Robots Exclusion Protocol File Parser
Project description
Robots Exclusion Standard Parser for Python
The robots
Python module implements a parser for robots.txt file. The recommended class to use is
robots.RobotsParser
. Besides, a thin facade robots.RobotFileParser
also exists to be used as
a substitute for urllib.robotparser.RobotFileParser
,
available in the Python standard library. The facade robots.RobotFileParser
exposes an API that is
mostly compatible with urllib.robotparser.RobotFileParser
.
The main reasons for this rewrite are the following:
- It was initially intended to experiment with parsing
robots.txt
for a link checker project (not implemented). - It is attempting to follow the latest internet draft Robots Exclusion Protocol.
- It does not try to be compliant with commonly accepted directives that are not in the current
specs such as
request-rate
andcrawl-delay
, but it currently supportssitemaps
. - It satisfies the same tests as the Google Robots.txt Parser, except for some custom behaviors specific to Google Robots.
Installation
Note: Python 3.8.x is required
You preferably want to install the robots
package after creating a Python virtual environment,
in a newly created directory, as follows:
$ mkdir project && cd project
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install --upgrade pip
(robotspy) $ python -m pip install --upgrade setuptools
(robotspy) $ python -m pip install robotspy
Usage
The robots
package can be imported as a module and also exposes an executable invokable with
python -m
.
Execute the Package
After installing robotspy
, you can validate the installation by running the following command:
(robotspy) $ python -m robots --help
usage: robots (<robots_path>|<robots_url>) <user_agent> <URI>
Shows whether the given user agent and URI combination are allowed or
disallowed by the given robots.txt file.
positional arguments:
robotstxt robots.txt file path or URL
useragent User agent name
uri Path or URI
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Examples
The content of http://www.pythontest.net/elsewhere/robots.txt is the following:
# Used by NetworkTestCase in Lib/test/test_robotparser.py
User-agent: Nutch
Disallow: /
Allow: /brian/
User-agent: *
Disallow: /webstats/
To check if the user agent Nutch
can fetch the path /brian/
you can execute:
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI '/brian/': ALLOWED
Or, you can also pass the full URL, http://www.pythontest.net/brian/:
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI 'http://www.pythontest.net/brian/': ALLOWED
Can user agent Nutch
fetch the path /brian
?
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian
user-agent 'Nutch' with URI '/brian': DISALLOWED
Or, /
?
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /
user-agent 'Nutch' with URI '/': DISALLOWED
How about user agent Johnny
?
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Johnny /
user-agent 'Johnny' with URI '/': ALLOWED
Use the Module in a Project
Here is an example with the same data as above, using the robots
package from the Python shell:
(robotspy) $ python
>>> import robots
>>> parser = robots.RobotsParser.from_uri('http://www.pythontest.net/elsewhere/robots.txt')
>>> useragent = 'Nutch'
>>> path = '/brian/'
>>> result = parser.can_fetch(useragent, path)
>>> print(f'Can {useragent} fetch {path}? {result}')
Can Nutch fetch /brian/? True
>>>
Bug in the Python standard library
There is a bug in urllib.robotparser
from the Python standard library that causes the following test to differ from the example above with robotspy
.
The example with urllib.robotparser
is the following:
$ python
>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url('http://www.pythontest.net/elsewhere/robots.txt')
>>> rp.read()
>>> rp.can_fetch('Nutch', '/brian/')
False
Notice that the result is False
whereas robotspy
return True
.
Bug bpo-39187 was open to raise awareness on this issue and PR
https://github.com/python/cpython/pull/17794 was submitted as a possible fix. robotspy
does not
exhibit this problem.
Development
The main development dependency is pytest
for executing the tests. It is automatically
installed if you perform the following steps:
$ git clone https://github.com/andreburgaud/robotspy
$ cd robotspy
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install -r requirements.txt
(robotspy) $ python -m pip install -e .
(robotspy) $ make test
(robotspy) $ deactivate
$
Other dependencies are intended for deployment to the Cheese Shop (PyPI):
See the build file, Makefile
, for the commands and parameters.
The Makefile
also invokes the following tools:
At this stage of the development, version 0.3.1, the three development tools above are expected to be installed globally.
Dependency Tree
To display the dependency tree:
$ pipdeptree
or
$ make tree
To display the reverse dependency tree of a particular package, idna
in the example below:
$ pipdeptree --reverse --packages idna
Release History
- 0.3.1:
- Updated
idna
andwcwidth
packages - Added
pipdeptree
package to provide visibility on dependencies - Fixed
mypy
errors - Explicitly ignored
pylint
errors related to commonly used names likef
,m
, orT
- Updated
- 0.3.0: Updated
bleach
package to address CVE-2020-6802 - 0.2.0: Updated the documentation
- 0.1.0: Initial release
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.