Skip to main content

Robots Exclusion File protocol parser

Project description

Robots Module

The robots module can be used as a substitute for urllib.robotparser available in the Python standard library. The API is the same to allow for backward compatibility.

The main reason for this rewrite are the following:

  1. This was initially intended to experiment with parsing robots.txt for a link checker project.
  2. The implementation is attempting to follow the latest internet draft Robots Exclusion Protocol.
  3. It tries to be compliant with some rules not in the specs but commonly accepted like the ones in urllib.robotparser, examples: Sitemap, Request-rate, Crawl-delay...
  4. It also includes the same tests as the Google Robots.txt Parser, except for some specific behavior specific to Google.

Installation

Note: Python 3.8.x is required

You preferably want to install the robots package after creating a Python virtual environment, in a newly created directory, as follows:

$ mkdir project && cd project
$ python3 -m pip install robotspy 

Usage

The robots package can be imported as a module and is also executable invokable with python -m.

Execute the Package

After installing robotspy, you can validate the installation by running the following command:

$ python -m robots --help
usage: robots (<robots_path>|<robots_url>) <user_agent> <URI>

Shows whether the given user agent and URI combination is allowed or
disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  uri            Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

A concrete example to check against http://www.pythontest.net/elsewhere/robots.txt if the user agent Nutch can fetch the path /brian/ would be done as follows:

$ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI '/brian/': ALLOWED

Use the Module in a Project

Here is an example with the same data as above, using the robots package from the Python shell:

>>> import robots
>>> parser = robots.RobotFileParser('http://www.pythontest.net/elsewhere/robots.txt')
>>> parser.read()
>>> useragent = 'Nutch'
>>> path = '/brian/'
>>> result = parser.can_fetch(useragent, path)
>>> print(f"Can {useragent} fetch {path}? {result}")
Can Nutch fetch /brian/? True
>>>

Development

The main development dependency is pytest for executing the tests. It is automatically installed if you perform the following steps:

$ git clone https://github.com/andreburgaud/robotspy
$ cd robotspy
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install -r requirements.txt
(robotspy) $ python -m pip install -e .
(robotspy) $ make test
(robotspy) $ deactivate
$

Other dependencies are intended for deployment to the Cheese Shop (PyPI):

  • wheel
  • twine

Release History

  • 0.1.0: Initial release

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotspy-0.1.0.tar.gz (10.5 kB view hashes)

Uploaded Source

Built Distribution

robotspy-0.1.0-py3-none-any.whl (10.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page