Robots Exclusion Protocol File Parser

These details have not been verified by PyPI

Project links

Homepage

Project description

Robots Exclusion Standard Parser for Python

The robotspy Python module implements a parser for robots.txt files. The recommended class to use is robots.RobotsParser.

A thin facade robots.RobotFileParser can also be used as a substitute for urllib.robotparser.RobotFileParser, available in the Python standard library. The class robots.RobotFileParser exposes an API that is mostly compatible with urllib.robotparser.RobotFileParser.

The main reasons for this rewrite are the following:

It was initially intended to experiment with parsing robots.txt files for a link checker project (not implemented yet).
It (mostly) follows the specs from the RFC 9309 - Robots Exclusion Protocol.
It does not try to be compliant with commonly accepted directives that are not in the current specs such as request-rate and crawl-delay, but it currently supports sitemaps.
It satisfies the same tests as the Google Robots.txt Parser, except for some custom behaviors specific to Google Robots.

To use the robots command line tool (CLI) in a Docker container, read the following section Docker Image.

To install robotspy globally as a tool on your system with pipx skip to the Global Installation section.

If you are interested in using robotspy in a local Python environment or as a library, skip to section Module Installation.

Docker Image

The Robotspy CLI, robots, is available as a Docker automated built image at https://hub.docker.com/r/andreburgaud/robotspy.

If you already have Docker installed on your machine, first pull the image from Docker Hub:

$ docker pull andreburgaud/robotspy

Then, you can exercise the tool against the following remote Python robots.txt test file located at http://www.pythontest.net/elsewhere/robots.txt:

# Used by NetworkTestCase in Lib/test/test_robotparser.py

User-agent: Nutch
Disallow: /
Allow: /brian/

User-agent: *
Disallow: /webstats/

The following examples demonstrate how to use the robots command line with the Docker container:

$ # Example 1: User agent "Johnny" is allowed to access path "/"
$ docker run --rm andreburgaud/robotspy http://www.pythontest.net/elsewhere/robots.txt Johnny /
user-agent 'Johnny' with path '/': ALLOWED

$ # Example 2:  User agent "Nutch" is not allowed to access path "/brian"
$ docker run --rm andreburgaud/robotspy http://www.pythontest.net/elsewhere/robots.txt Nutch /brian
user-agent 'Nutch' with path '/brian': DISALLOWED

$ # Example 3: User agent "Johnny" is not allowed to access path "/webstats/"
docker run --rm andreburgaud/robotspy http://www.pythontest.net/elsewhere/robots.txt Johnny /webstats/
user-agent 'Johnny' with path '/webstats/': DISALLOWED

The arguments are the following:

Location of the robots.txt file (http://www.pythontest.net/elsewhere/robots.txt)
User agent name (Johnny)
Path or URL (/)

Without any argument, robots displays the help:

docker run --rm andreburgaud/robotspy
usage: robots <robotstxt> <useragent> <path>

Shows whether the given user agent and path combination are allowed or disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  path           Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

To use the CLI robots as a global tools, continue to the following section. If you want to use robotspy as a Python module, skip to Module Installation.

Global Installation with pipx

If you only want to use the command line tool robots, you may want to use pipx to install it as a global tool on your system.

To install robotspy using pipx execute the following command:

$  pipx install robotspy

When robotspy is installed globally on your system, you can invoke it from any folder locations. For example, you can execute:

$  robots --version
robots 0.8.0

You can see more detailed usages in section Usage.

Module Installation

Note: Python 3.8.x or 3.9.x required

You preferably want to install the robotspy package after creating a Python virtual environment, in a newly created directory, as follows:

$ mkdir project && cd project
$ python -m venv .venv
$ . .venv/bin/activate
(.venv) $ python -m pip install --upgrade pip
(.venv) $ python -m pip install --upgrade setuptools
(.venv) $ python -m pip install robotspy
(.venv) $ python -m robots --help
...

On Windows:

C:/> mkdir project && cd project
C:/> python -m venv .venv
C:/> .venv\scripts\activate
(.venv) c:\> python -m pip install --upgrade pip
(.venv) c:\> python -m pip install --upgrade setuptools
(.venv) c:\> python -m pip install robotspy
(.venv) c:\> python -m robots --help
...

Usage

The robotspy package can be imported as a module and also exposes an executable, robots, invocable with python -m. If installed globally with pipx, the command robots can be invoked from any folders. The usage examples in the following section use the command robots, but you can also substitute it with python -m robots in a virtual environment.

Execute the Tool

After installing robotspy, you can validate the installation by running the following command:

$ robots --help
usage: robots <robotstxt> <useragent> <path>

Shows whether the given user agent and path combination are allowed or disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  path           Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

Examples

The content of http://www.pythontest.net/elsewhere/robots.txt is the following:

# Used by NetworkTestCase in Lib/test/test_robotparser.py

User-agent: Nutch
Disallow: /
Allow: /brian/

User-agent: *
Disallow: /webstats/

To check if the user agent Nutch can fetch the path /brian/ you can execute:

$ robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with path '/brian/': ALLOWED

Or, you can also pass the full URL, http://www.pythontest.net/brian/:

$ robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with url 'http://www.pythontest.net/brian/': ALLOWED

Can user agent Nutch fetch the path /brian?

$ robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian
user-agent 'Nutch' with path '/brian': DISALLOWED

Or, /?

$ robots http://www.pythontest.net/elsewhere/robots.txt Nutch /
user-agent 'Nutch' with path '/': DISALLOWED

How about user agent Johnny?

$ robots http://www.pythontest.net/elsewhere/robots.txt Johnny /
user-agent 'Johnny' with path '/': ALLOWED

Use the Module in a Project

If you have a virtual environment with the robotspy package installed, you can use the robots module from the Python shell:

(.venv) $ python
>>> import robots
>>> parser = robots.RobotsParser.from_uri('http://www.pythontest.net/elsewhere/robots.txt')
>>> useragent = 'Nutch'
>>> path = '/brian/'
>>> result = parser.can_fetch(useragent, path)
>>> print(f'Can {useragent} fetch {path}? {result}')
Can Nutch fetch /brian/? True
>>>

Bug in the Python standard library

There is a bug in urllib.robotparser from the Python standard library that causes the following test to differ from the example above with robotspy.

The example with urllib.robotparser is the following:

$ python
>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url('http://www.pythontest.net/elsewhere/robots.txt')
>>> rp.read()
>>> rp.can_fetch('Nutch', '/brian/')
False

Notice that the result is False whereas robotspy returns True.

Bug bpo-39187 was open to raise awareness on this issue and PR https://github.com/python/cpython/pull/17794 was submitted as a possible fix. robotspy does not exhibit this problem.

Development

The main development dependency is pytest for executing the tests. It is automatically installed if you perform the following steps:

$ git clone https://github.com/andreburgaud/robotspy
$ cd robotspy
$ python -m venv .venv --prompt robots
$ . .venv/bin/activate
(robots) $ python -m pip install -r requirements.txt
(robots) $ python -m pip install -e .
(robots) $ make test
(robots) $ deactivate
$

On Windows:

C:/> git clone https://github.com/andreburgaud/robotspy
C:/> cd robotspy
C:/> python -m venv .venv --prompt robotspy
C:/> .venv\scripts\activate
(robots) c:\> python -m pip install -r requirements.txt
(robots) c:\> python -m pip install -e .
(robots) c:\> make test
(robots) c:\> deactivate

Global Tools

The following tools were used during the development of robotspy:

See the build file, Makefile or make.bat on Windows, for the commands and parameters.

Release History

0.10.0:
- Fixed bugs in the URL path pattern matching ('?' is now handled correctly as the character '?' instead of matching any one character)
- Added tests 541230 and 541230 from Google project https://github.com/google/robotstxt-spec-test
- Contribution from https://github.com/kox-solid
0.9.0:
- Updated the parser to behave like the Google robots parser. It now handles the product token in the user-agent line up to the last correct character instead of discarding it. See issue #209 for more details.
- Contribution from https://github.com/kox-solid
0.8.0:
- Addressed an issue raised when a robots.txt file is not UTF-8 encoded
- Added a user agent to fetch the robots.txt, as some websites, such as pages hosted on Cloudflare, may return a 403 error
- Updated the documentation to link to RFC 9309, Robots Exclusion Protocol (REP)
- Added a GitHub action job to execute the tests against Python versions 3.8 to 3.12
- Contribution from https://github.com/tumma72
0.7.0:
- Fixed bug with the argument path when using the CLI
- Print 'url' when the argument is a URL, 'path' otherwise
0.6.0:
- Simplified dependencies by keeping only pytest in requirements.txt
0.5.0:
- Updated all libraries. Tested with Python 3.9.
0.4.0:
- Fixed issue with robots text pointed by relative paths
- Integration of Mypy, Black and Pylint as depencencies to ease cross-platform development
- Limited make.bat build file for Windows
- Git ignore vscode files, tmp directory, multiple virtual env (.venv*)
- Fixed case insensitive issues on Windows
- Tests successful on Windows
- Added an ATRIBUTIONS files and build task to generate it
- Upgraded pyparsing and certifi
0.3.3:
- Upgraded tqdm, and cryptography packages
- 0.3.2:
- Upgraded bleach, tqdm, and setuptools packages
0.3.1:
- Updated idna and wcwidth packages
- Added pipdeptree package to provide visibility on dependencies
- Fixed mypy errors
- Explicitly ignored pylint errors related to commonly used names like f, m, or T
0.3.0: Updated bleach package to address CVE-2020-6802
0.2.0: Updated the documentation
0.1.0: Initial release

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.12.0

Oct 26, 2024

0.11.0

Oct 24, 2024

0.10.0

Aug 27, 2024

0.9.0

Aug 25, 2024

0.8.0

Jul 5, 2024

0.7.0

Mar 19, 2021

0.6.0

Mar 19, 2021

0.5.1

Dec 28, 2020

0.4.0

Apr 7, 2020

0.3.3

Apr 6, 2020

0.3.2

Mar 31, 2020

0.3.1

Mar 24, 2020

0.3.0

Mar 18, 2020

0.2.0

Feb 1, 2020

0.1.0

Jan 29, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotspy-0.12.0.tar.gz (32.6 kB view details)

Uploaded Oct 26, 2024 Source

Built Distribution

robotspy-0.12.0-py3-none-any.whl (14.0 kB view details)

Uploaded Oct 26, 2024 Python 3

File details

Details for the file robotspy-0.12.0.tar.gz.

File metadata

Download URL: robotspy-0.12.0.tar.gz
Upload date: Oct 26, 2024
Size: 32.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for robotspy-0.12.0.tar.gz
Algorithm	Hash digest
SHA256	`3537184bf64fa8e9b82ae444154154cdc3757a2540d1bb91c11d74838b60edf4`
MD5	`82a8366eba24ddb888394ff2f7eba25e`
BLAKE2b-256	`e2ef954dfa2ece7189bc0016254acd13ebe183f9201f0232ca422297f419f16f`

See more details on using hashes here.

File details

Details for the file robotspy-0.12.0-py3-none-any.whl.

File metadata

Download URL: robotspy-0.12.0-py3-none-any.whl
Upload date: Oct 26, 2024
Size: 14.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for robotspy-0.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a2bd6fe753af31c5982d97c9763a0cde45ccf0403d4616c0c67e426b7eaf041`
MD5	`2788651a40d5240ceb5848841535ffe1`
BLAKE2b-256	`4e0f4ed448a208f9f322889e4d1e9544bc234004fb5995bf2dce3cdd65a1acbf`

See more details on using hashes here.

robotspy 0.12.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Robots Exclusion Standard Parser for Python

Docker Image

Global Installation with pipx

Module Installation

Usage

Execute the Tool

Examples

Use the Module in a Project

Bug in the Python standard library

Development

Global Tools

Release History

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes