Robots Exclusion Protocol File Parser

These details have not been verified by PyPI

Project links

Homepage

Project description

Robots Exclusion Standard Parser for Python

The robots Python module implements a parser for robots.txt file. The recommended class to use is robots.RobotsParser. Besides, a thin facade robots.RobotFileParser also exists to be used as a substitute for urllib.robotparser.RobotFileParser, available in the Python standard library. The facade robots.RobotFileParser exposes an API that is mostly compatible with urllib.robotparser.RobotFileParser.

The main reasons for this rewrite are the following:

It was initially intended to experiment with parsing robots.txt for a link checker project (not implemented).
It is attempting to follow the latest internet draft Robots Exclusion Protocol.
It does not try to be compliant with commonly accepted directives that are not in the current specs such as request-rate and crawl-delay, but it currently supports sitemaps.
It satisfies the same tests as the Google Robots.txt Parser, except for some custom behaviors specific to Google Robots.

Installation

Note: Python 3.8.x is required

You preferably want to install the robots package after creating a Python virtual environment, in a newly created directory, as follows:

$ mkdir project && cd project
$ python3 -m pip install robotspy

Usage

The robots package can be imported as a module and also exposes an executable invokable with python -m.

Execute the Package

After installing robotspy, you can validate the installation by running the following command:

$ python -m robots --help
usage: robots (<robots_path>|<robots_url>) <user_agent> <URI>

Shows whether the given user agent and URI combination are allowed or
disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  uri            Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

A concrete example to check against http://www.pythontest.net/elsewhere/robots.txt if the user agent Nutch can fetch the path /brian/ would be done as follows:

$ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI '/brian/': ALLOWED

Use the Module in a Project

Here is an example with the same data as above, using the robots package from the Python shell:

>>> import robots
>>> parser = robots.RobotsParser.from_uri('http://www.pythontest.net/elsewhere/robots.txt')
>>> useragent = 'Nutch'
>>> path = '/brian/'
>>> result = parser.can_fetch(useragent, path)
>>> print(f"Can {useragent} fetch {path}? {result}")
Can Nutch fetch /brian/? True
>>>

Development

The main development dependency is pytest for executing the tests. It is automatically installed if you perform the following steps:

$ git clone https://github.com/andreburgaud/robotspy
$ cd robotspy
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install -r requirements.txt
(robotspy) $ python -m pip install -e .
(robotspy) $ make test
(robotspy) $ deactivate
$

Other dependencies are intended for deployment to the Cheese Shop (PyPI):

wheel
twine

The Makefile also invokes the following tools:

At this stage of the development, version 0.3.0, these development tools are expected to be installed globally.

Dependency Tree

To display the dependency tree:

$ pipdeptree

$ make tree

To display the reverse dependency tree of a particular package, idna in the example below:

$ pipdeptree --reverse --packages idna

Release History

0.3.0: Updated bleach package to address CVE-2020-6802
0.2.0: Updated the documentation
0.1.0: Initial release

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.0

Oct 26, 2024

0.11.0

Oct 24, 2024

0.10.0

Aug 27, 2024

0.9.0

Aug 25, 2024

0.8.0

Jul 5, 2024

0.7.0

Mar 19, 2021

0.6.0

Mar 19, 2021

0.5.1

Dec 28, 2020

0.4.0

Apr 7, 2020

0.3.3

Apr 6, 2020

0.3.2

Mar 31, 2020

This version

0.3.1

Mar 24, 2020

0.3.0

Mar 18, 2020

0.2.0

Feb 1, 2020

0.1.0

Jan 29, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotspy-0.3.1.tar.gz (11.3 kB view hashes)

Uploaded Mar 24, 2020 Source

Built Distribution

robotspy-0.3.1-py3-none-any.whl (11.4 kB view hashes)

Uploaded Mar 24, 2020 Python 3

Hashes for robotspy-0.3.1.tar.gz

Hashes for robotspy-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`8a9ba903bc7e0005f1f948c8b361373480a987440b59b7721fd26b396e9c8d7a`
MD5	`80309b686cb7054ccbe1919a9b80bb60`
BLAKE2b-256	`afcf46ddffc45726bfaa2632deb9acf60adad7fde6fbf68ae2fcad9648d7016b`

Hashes for robotspy-0.3.1-py3-none-any.whl

Hashes for robotspy-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ee802cb8c7d406167c7273f5acd21e47479c2cacc42653b9708ea916902f2b1d`
MD5	`b1402146cd9153288f14380888275730`
BLAKE2b-256	`02039540e84f704573bcfaea8c2cae009afc1ccc5989ef324dda9fc717bb9108`