Skip to main content

A pure Python port of Google's robots.txt parser and matcher

Project description

gpyrobotstxt

License: GPL v3 Used By Test Status

gpyrobotstxt is a native Python port of Google's robots.txt parser and matcher C++ library.

  • Preserves all behaviour of the original library
  • All 100% of the original test suite functionality
  • Minor language-specific cleanups

As per Google's original library, we include a small main executable, for webmasters, that allows testing a single URL and user-agent against a robots.txt. Ours is called robots_main.py, and its inputs and outputs are compatible with the original tool.

About

Quoting the README from Google's robots.txt parser and matcher repo:

The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.

Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

The library is slightly modified (i.e. some internal headers and equivalent symbols) production code used by Googlebot, Google's crawler, to determine which URLs it may access based on rules provided by webmasters in robots.txt files. The library is released open-source to help developers build tools that better reflect Google's robots.txt parsing and matching.

The package gpyrobotstxt aims to be a faithful conversion, from C++ to Python, of Google's robots.txt parser and matcher.

Pre-requisites

  • Python version 3.9 Older Python releases are likely NOT OK. Python versions above 3.11 should work fine, but only Python 3.9, 3.10 & 3.11 have been tested so far.

Installation

pip install gpyrobotstxt

Example Code (as a library)

from gpyrobotstxt.robots_cc import RobotsMatcher

if __name__ == "__main__":
    # Contents of robots.txt file.
    robotsTxt_content = b"""
        # robots.txt with restricted area

        User-agent: *
        Disallow: /members/*

        Sitemap: http://example.net/sitemap.xml
    """
    # Target URI.
    uri = "http://example.net/members/index.html"

    matcher = RobotsMatcher()
    allowed = matcher.allowed_by_robots(robotsTxt_content, "FooBot/1.0", uri)

Testing

To run the tests execute python -m unittest discover -s test -p test_*.py For a specific test python -m unittest discover -s test -p [TEST_NAME].py, for example, python -m unittest discover -s test -p test_google_only_system.py

Use the tool

$ python robots_main.py /local/path/to/robots.txt TestBot https://example.com/url
user-agent 'YourBot' with URI 'https://example.com/url': ALLOWED

Additionally, one can pass multiple user-agent names to the tool, using comma-separated values, e.g.

$ python robots_main.py /local/path/to/robots.txt Googlebot,Googlebot-image https://example.com/url
user-agent 'Googlebot,Googlebot-image' with URI 'https://example.com/url': ALLOWED

Notes

The library required that the URI passed to the AgentAllowed and AgentsAllowed functions, or to the URI parameter of the standalone binary tool, should follow the encoding/escaping format specified by RFC3986, because the library itself does not perform URI normalisation.

License

Like the original library, gpyrobotstxt is licensed under the terms of the GNU General Public License v3.0 (GNU GPL V3).

See LICENSE for more information.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpyrobotstxt-1.0.0.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpyrobotstxt-1.0.0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file gpyrobotstxt-1.0.0.tar.gz.

File metadata

  • Download URL: gpyrobotstxt-1.0.0.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for gpyrobotstxt-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d25f223a728f5e37166a1f58418b01140a2d6912e0cc34f91eefd4094d37c71b
MD5 a852df94963dcf412a74e875ced8a729
BLAKE2b-256 d216acd0290cc2b428dbc1453257a8ca75f5ef54b57b70b0e3c043848746f51a

See more details on using hashes here.

File details

Details for the file gpyrobotstxt-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: gpyrobotstxt-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for gpyrobotstxt-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 465076abe3e7de00a980850f1714d0a5276ccbad28244a8b9b5d4bce0d6e5c7d
MD5 ad0ee6de045ddc45d920cbc00c1123d0
BLAKE2b-256 79eab00b591658a562dfd2c80ed59147c1db53e439c71ecb07d7d4a9883846c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page