Skip to main content

Robots Exclusion Protocol File Parser

Project description

Robots Exclusion Standard Parser for Python

The robots Python module implements a parser for robots.txt file. The recommended class to use is robots.RobotsParser. Besides, a thin facade robots.RobotFileParser also exists to be used as a substitute for urllib.robotparser.RobotFileParser, available in the Python standard library. The facade robots.RobotFileParser exposes an API that is mostly compatible with urllib.robotparser.RobotFileParser.

The main reasons for this rewrite are the following:

  1. It was initially intended to experiment with parsing robots.txt for a link checker project (not implemented).
  2. It is attempting to follow the latest internet draft Robots Exclusion Protocol.
  3. It does not try to be compliant with commonly accepted directives that are not in the current specs such as request-rate and crawl-delay, but it currently supports sitemaps.
  4. It satisfies the same tests as the Google Robots.txt Parser, except for some custom behaviors specific to Google Robots.

Installation

Note: Python 3.8.x is required

You preferably want to install the robots package after creating a Python virtual environment, in a newly created directory, as follows:

$ mkdir project && cd project
$ python3 -m pip install robotspy

Usage

The robots package can be imported as a module and also exposes an executable invokable with python -m.

Execute the Package

After installing robotspy, you can validate the installation by running the following command:

$ python -m robots --help
usage: robots (<robots_path>|<robots_url>) <user_agent> <URI>

Shows whether the given user agent and URI combination are allowed or
disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  uri            Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

A concrete example to check against http://www.pythontest.net/elsewhere/robots.txt if the user agent Nutch can fetch the path /brian/ would be done as follows:

$ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI '/brian/': ALLOWED

Use the Module in a Project

Here is an example with the same data as above, using the robots package from the Python shell:

>>> import robots
>>> parser = robots.RobotsParser.from_uri('http://www.pythontest.net/elsewhere/robots.txt')
>>> useragent = 'Nutch'
>>> path = '/brian/'
>>> result = parser.can_fetch(useragent, path)
>>> print(f"Can {useragent} fetch {path}? {result}")
Can Nutch fetch /brian/? True
>>>

Development

The main development dependency is pytest for executing the tests. It is automatically installed if you perform the following steps:

$ git clone https://github.com/andreburgaud/robotspy
$ cd robotspy
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install -r requirements.txt
(robotspy) $ python -m pip install -e .
(robotspy) $ make test
(robotspy) $ deactivate
$

Other dependencies are intended for deployment to the Cheese Shop (PyPI):

  • wheel
  • twine

The Makefile also invokes the following tools:

At this stage of the development, version 0.1.0, these development tools are expected to be installed globally.

Release History

  • 0.3.0: TBD
  • 0.2.0: Updated the documentation
  • 0.1.0: Initial release

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotspy-0.3.0.tar.gz (11.0 kB view hashes)

Uploaded Source

Built Distribution

robotspy-0.3.0-py3-none-any.whl (11.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page