Skip to main content

Robots Exclusion Protocol File Parser

Project description

Robots Exclusion Standard Parser for Python

The robots Python module implements a parser for robots.txt file. The recommended class to use is robots.RobotsParser. Besides, a thin facade robots.RobotFileParser also exists to be used as a substitute for urllib.robotparser.RobotFileParser, available in the Python standard library. The facade robots.RobotFileParser exposes an API that is mostly compatible with urllib.robotparser.RobotFileParser.

The main reasons for this rewrite are the following:

  1. It was initially intended to experiment with parsing robots.txt for a link checker project (not implemented).
  2. It is attempting to follow the latest internet draft Robots Exclusion Protocol.
  3. It does not try to be compliant with commonly accepted directives that are not in the current specs such as request-rate and crawl-delay, but it currently supports sitemaps.
  4. It satisfies the same tests as the Google Robots.txt Parser, except for some custom behaviors specific to Google Robots.

Installation

Note: Python 3.8.x is required

You preferably want to install the robots package after creating a Python virtual environment, in a newly created directory, as follows:

$ mkdir project && cd project
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install --upgrade pip
(robotspy) $ python -m pip install --upgrade setuptools
(robotspy) $ python -m pip install robotspy

Usage

The robots package can be imported as a module and also exposes an executable invokable with python -m.

Execute the Package

After installing robotspy, you can validate the installation by running the following command:

(robotspy) $ python -m robots --help
usage: robots (<robots_path>|<robots_url>) <user_agent> <URI>

Shows whether the given user agent and URI combination are allowed or
disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  uri            Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

Examples

The content of http://www.pythontest.net/elsewhere/robots.txt is the following:

# Used by NetworkTestCase in Lib/test/test_robotparser.py

User-agent: Nutch
Disallow: /
Allow: /brian/

User-agent: *
Disallow: /webstats/

To check if the user agent Nutch can fetch the path /brian/ you can execute:

(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI '/brian/': ALLOWED

Or, you can also pass the full URL, http://www.pythontest.net/brian/:

(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI 'http://www.pythontest.net/brian/': ALLOWED

Can user agent Nutch fetch the path /brian?

(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian
user-agent 'Nutch' with URI '/brian': DISALLOWED

Or, /?

(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /
user-agent 'Nutch' with URI '/': DISALLOWED

How about user agent Johnny?

(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Johnny /
user-agent 'Johnny' with URI '/': ALLOWED

Use the Module in a Project

Here is an example with the same data as above, using the robots package from the Python shell:

(robotspy) $ python
>>> import robots
>>> parser = robots.RobotsParser.from_uri('http://www.pythontest.net/elsewhere/robots.txt')
>>> useragent = 'Nutch'
>>> path = '/brian/'
>>> result = parser.can_fetch(useragent, path)
>>> print(f'Can {useragent} fetch {path}? {result}')
Can Nutch fetch /brian/? True
>>>

Bug in the Python standard library

There is a bug in urllib.robotparser from the Python standard library that causes the following test to differ from the example above with robotspy.

The example with urllib.robotparser is the following:

$ python
>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url('http://www.pythontest.net/elsewhere/robots.txt')
>>> rp.read()
>>> rp.can_fetch('Nutch', '/brian/')
False

Notice that the result is False whereas robotspy return True.

Bug bpo-39187 was open to raise awareness on this issue and PR https://github.com/python/cpython/pull/17794 was submitted as a possible fix. robotspy does not exhibit this problem.

Development

The main development dependency is pytest for executing the tests. It is automatically installed if you perform the following steps:

$ git clone https://github.com/andreburgaud/robotspy
$ cd robotspy
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install -r requirements.txt
(robotspy) $ python -m pip install -e .
(robotspy) $ make test
(robotspy) $ deactivate
$

On Windows:

C:/> git clone https://github.com/andreburgaud/robotspy
C:/> cd robotspy
C:/> python -m venv .venv --prompt robotspy
C:/> .venv\scripts\activate
(robotspy) $ python -m pip install -r requirements.txt
(robotspy) $ python -m pip install -e .
(robotspy) $ make test
(robotspy) $ deactivate

Other dependencies are intended for deployment to the Cheese Shop (PyPI):

See the build file, Makefile, for the commands and parameters.

Dependency Tree

To display the dependency tree:

$ pipdeptree

or

$ make tree

To display the reverse dependency tree of a particular package, idna in the example below:

$ pipdeptree --reverse --packages idna

Attributions

Although robotspy does not have any dependencies other than packages in the Python standard libraries, a few tools are used for testing, validating, packaging and deploying this library.

You can consult the list of these tools, their respective versions, licenses and web sites by consulting ATTRIBUTIONS.

Release History

  • 0.4.0:
    • Fixed issue with robots text pointed by relative paths
    • Integration of MyPy, Black and PyLint as depencencies to ease cross-platform development
    • Limited make.bat build file for Windows
    • Git ignore vscode files, tmp directory, multiple virtual env (.venv*)
    • Fixed case insensitive issues on Windows
    • Tests successful on Windows
    • Added an ATRIBUTIONS files and build task to generate it
    • Upgraded pyparsing and certifi
  • 0.3.3:
    • Upgraded tqdm, and cryptography packages
    • 0.3.2:
    • Upgraded bleach, tqdm, and setuptools packages
  • 0.3.1:
    • Updated idna and wcwidth packages
    • Added pipdeptree package to provide visibility on dependencies
    • Fixed mypy errors
    • Explicitly ignored pylint errors related to commonly used names like f, m, or T
  • 0.3.0: Updated bleach package to address CVE-2020-6802
  • 0.2.0: Updated the documentation
  • 0.1.0: Initial release

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotspy-0.4.0.tar.gz (13.5 kB view hashes)

Uploaded Source

Built Distribution

robotspy-0.4.0-py3-none-any.whl (12.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page