Skip to main content

A Python package to check URL paths against robots directives a robots.txt file.

Project description

Robots Text Processor

A Python package for processing and validating robots.txt files according to the Robots Exclusion Protocol (REP) RFC.

Features

  • Parse and process robots.txt files
  • Extract user-agent rules, allow/disallow directives, and sitemaps
  • Test URLs against robots.txt rules
  • Validate robots.txt files for compliance with REP RFC
  • Generate hashes for content tracking
  • Comprehensive error and warning reporting

Installation

pip install robotstxt-package

Usage

Basic Usage

from robotstxt import robots_file

# Create a RobotsFile instance
robots = robots_file("""
User-agent: *
Allow: /
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
""")

# Test a URL
result = robots.test_url("https://example.com/private/page", "*")
print(result)  # {'disallowed': True, 'matching_rule': '/private/'}

# Get sitemaps
for sitemap in robots.sitemaps:
    print(sitemap.url)

Validation

The package includes comprehensive validation of robots.txt files:

from robotstxt import robots_file

# Create a RobotsFile instance
robots = robots_file("""
User-agent: *
Allow: /
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
""")

# Check for validation errors
if robots.has_errors():
    for error in robots.get_validation_errors():
        print(f"Error: {error.message} (Line {error.line_number})")

# Check for validation warnings
if robots.has_warnings():
    for warning in robots.get_validation_warnings():
        print(f"Warning: {warning.message} (Line {warning.line_number})")

The validation system checks for:

  • File size exceeding 500KB
  • UTF-8 Byte Order Mark (BOM) presence
  • Invalid characters in lines
  • Proper directive formatting (user-agent, allow, disallow, sitemap)
  • Rule block structure (user-agent directives before allow/disallow rules)
  • Valid sitemap URLs
  • Duplicate rules for the same user-agent
  • And more...

Content Tracking

The package provides hashing functionality for tracking changes:

from robotstxt import robots_file

robots = robots_file(content)

# Get content hash
print(robots.hash_raw)  # SHA-256 hash of raw content

# Get rules hash
print(robots.hash_material)  # SHA-256 hash of processed rules

# Get sitemaps hash
print(robots.hash_sitemaps)  # SHA-256 hash of sitemap URLs

Validation Rules

The package validates robots.txt files according to the following rules:

  1. File Size

    • Warning if file exceeds 500KB
    • Error if file exceeds 512KB
  2. Character Encoding

    • Error if file contains UTF-8 BOM
    • Error if file contains invalid characters
  3. Directive Format

    • Error for invalid directive format
    • Error for missing user-agent before allow/disallow rules
    • Warning for common typos in directives
  4. Rule Structure

    • Error for duplicate rule blocks
    • Warning for conflicting allow/disallow rules
    • Warning for overly broad rules
  5. Sitemaps

    • Error for invalid sitemap URLs
    • Warning for multiple sitemaps

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

compare_robots_files(robots1, robots2)

Compares two robots.txt files and generates a structured diff showing:

  • Material differences between the files
  • Per-token changes showing added and removed rules
  • Sitemap changes

The function normalizes the rules before comparison to ensure accurate diffing regardless of:

  • Rule ordering within token groups
  • Whitespace differences
  • Case sensitivity in tokens
  • Duplicate rules

Returns a dictionary containing:

  • materially_different: Boolean indicating if the files have different rules
  • token_diffs: Dictionary of differences per token
  • sitemap_changes: Dictionary of sitemap differences

Example usage

robots1 = RobotsFile(old_content) robots2 = RobotsFile(new_content)

Either way works:

diff = compare_robots_files(robots1, robots2)

or

diff = robots1.compare_with(robots2)

Example output structure:

{

"materially_different": True,

"token_diffs": {

"googlebot": {

"added": ["Allow: /new-path/", "Disallow: /private/"],

"removed": ["Disallow: /old-path/"]

},

"bingbot": {

"added": ["Disallow: /api/"],

"removed": []

}

},

"sitemap_changes": {

"added": ["https://example.com/new-sitemap.xml"],

"removed": ["https://example.com/old-sitemap.xml"]

}

}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotstxt-1.1.2.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robotstxt-1.1.2-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file robotstxt-1.1.2.tar.gz.

File metadata

  • Download URL: robotstxt-1.1.2.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for robotstxt-1.1.2.tar.gz
Algorithm Hash digest
SHA256 ab406b1f7d4d348403a0df1542425637c954b623102e67805391cb5de5adad54
MD5 93ac0446a5a352a8a3b3b9032354f544
BLAKE2b-256 981c17ea5d9b8548826ed790aeebee65e78d2e4756e93c671c16570cf38f2312

See more details on using hashes here.

File details

Details for the file robotstxt-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: robotstxt-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for robotstxt-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 38fcb6c6dea6566ceffeb1e44c88f1015a96585084d232fc9de151dbd25da1c6
MD5 bd13de33de95404441e46089f5e4727d
BLAKE2b-256 120eadc571886b6a4c41e1382972ee853895a14dd67a606867e24c12d482e40b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page