A Python package to check URL paths against robots directives a robots.txt file.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Robots Text Processor

A Python package for processing and validating robots.txt files according to the Robots Exclusion Protocol (REP) RFC.

Features

Parse and process robots.txt files
Extract user-agent rules, allow/disallow directives, and sitemaps
Test URLs against robots.txt rules
Validate robots.txt files for compliance with REP RFC
Generate hashes for content tracking
Comprehensive error and warning reporting

Installation

pip install robotstxt-package

Usage

Basic Usage

from robotstxt import robots_file

# Create a RobotsFile instance
robots = robots_file("""
User-agent: *
Allow: /
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
""")

# Test a URL
result = robots.test_url("https://example.com/private/page", "*")
print(result)  # {'disallowed': True, 'matching_rule': '/private/'}

# Get sitemaps
for sitemap in robots.sitemaps:
    print(sitemap.url)

Validation

The package includes comprehensive validation of robots.txt files:

from robotstxt import robots_file

# Create a RobotsFile instance
robots = robots_file("""
User-agent: *
Allow: /
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
""")

# Check for validation errors
if robots.has_errors():
    for error in robots.get_validation_errors():
        print(f"Error: {error.message} (Line {error.line_number})")

# Check for validation warnings
if robots.has_warnings():
    for warning in robots.get_validation_warnings():
        print(f"Warning: {warning.message} (Line {warning.line_number})")

The validation system checks for:

File size exceeding 500KB
UTF-8 Byte Order Mark (BOM) presence
Invalid characters in lines
Proper directive formatting (user-agent, allow, disallow, sitemap)
Rule block structure (user-agent directives before allow/disallow rules)
Valid sitemap URLs
Duplicate rules for the same user-agent
And more...

Content Tracking

The package provides hashing functionality for tracking changes:

from robotstxt import robots_file

robots = robots_file(content)

# Get content hash
print(robots.hash_raw)  # SHA-256 hash of raw content

# Get rules hash
print(robots.hash_material)  # SHA-256 hash of processed rules

# Get sitemaps hash
print(robots.hash_sitemaps)  # SHA-256 hash of sitemap URLs

Validation Rules

The package validates robots.txt files according to the following rules:

File Size
- Warning if file exceeds 500KB
- Error if file exceeds 512KB
Character Encoding
- Error if file contains UTF-8 BOM
- Error if file contains invalid characters
Directive Format
- Error for invalid directive format
- Error for missing user-agent before allow/disallow rules
- Warning for common typos in directives
Rule Structure
- Error for duplicate rule blocks
- Warning for conflicting allow/disallow rules
- Warning for overly broad rules
Sitemaps
- Error for invalid sitemap URLs
- Warning for multiple sitemaps

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

compare_robots_files(robots1, robots2)

Compares two robots.txt files and generates a structured diff showing:

Material differences between the files
Per-token changes showing added and removed rules
Sitemap changes

The function normalizes the rules before comparison to ensure accurate diffing regardless of:

Rule ordering within token groups
Whitespace differences
Case sensitivity in tokens
Duplicate rules

Returns a dictionary containing:

materially_different: Boolean indicating if the files have different rules
token_diffs: Dictionary of differences per token
sitemap_changes: Dictionary of sitemap differences

Example usage

robots1 = RobotsFile(old_content) robots2 = RobotsFile(new_content)

Either way works:

diff = compare_robots_files(robots1, robots2)

or

diff = robots1.compare_with(robots2)

Example output structure:

{

"materially_different": True,

"token_diffs": {

"googlebot": {

"added": ["Allow: /new-path/", "Disallow: /private/"],

"removed": ["Disallow: /old-path/"]

},

"bingbot": {

"added": ["Disallow: /api/"],

"removed": []

}

},

"sitemap_changes": {

"added": ["https://example.com/new-sitemap.xml"],

"removed": ["https://example.com/old-sitemap.xml"]

}

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.1.2

Apr 8, 2025

1.1.1

Apr 7, 2025

1.1.0

Apr 7, 2025

1.0.9

Apr 7, 2025

1.0.8

Apr 7, 2025

1.0.7

Apr 7, 2025

1.0.6

Apr 7, 2025

1.0.5

Apr 7, 2025

1.0.4

Mar 30, 2025

1.0.3

Nov 19, 2022

1.0.2

Nov 19, 2022

1.0.1

Nov 18, 2022

1.0

Nov 18, 2022

0.0.10

Mar 17, 2022

0.0.8

Jun 4, 2020

0.0.7

Jun 2, 2020

0.0.6

May 25, 2020

0.0.5

May 14, 2020

0.0.4

May 14, 2020

0.0.3

May 13, 2020

0.0.2

May 13, 2020

0.0.1

May 3, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotstxt-1.1.2.tar.gz (16.4 kB view details)

Uploaded Apr 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

robotstxt-1.1.2-py3-none-any.whl (18.0 kB view details)

Uploaded Apr 8, 2025 Python 3

File details

Details for the file robotstxt-1.1.2.tar.gz.

File metadata

Download URL: robotstxt-1.1.2.tar.gz
Upload date: Apr 8, 2025
Size: 16.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for robotstxt-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`ab406b1f7d4d348403a0df1542425637c954b623102e67805391cb5de5adad54`
MD5	`93ac0446a5a352a8a3b3b9032354f544`
BLAKE2b-256	`981c17ea5d9b8548826ed790aeebee65e78d2e4756e93c671c16570cf38f2312`

See more details on using hashes here.

File details

Details for the file robotstxt-1.1.2-py3-none-any.whl.

File metadata

Download URL: robotstxt-1.1.2-py3-none-any.whl
Upload date: Apr 8, 2025
Size: 18.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for robotstxt-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`38fcb6c6dea6566ceffeb1e44c88f1015a96585084d232fc9de151dbd25da1c6`
MD5	`bd13de33de95404441e46089f5e4727d`
BLAKE2b-256	`120eadc571886b6a4c41e1382972ee853895a14dd67a606867e24c12d482e40b`

See more details on using hashes here.

robotstxt 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Robots Text Processor

Features

Installation

Usage

Basic Usage

Validation

Content Tracking

Validation Rules

Contributing

License

compare_robots_files(robots1, robots2)

Example usage

Either way works:

or

Example output structure:

{

"materially_different": True,

"token_diffs": {

"googlebot": {

"added": ["Allow: /new-path/", "Disallow: /private/"],

"removed": ["Disallow: /old-path/"]

},

"bingbot": {

"added": ["Disallow: /api/"],

"removed": []

}

},

"sitemap_changes": {

"added": ["https://example.com/new-sitemap.xml"],

"removed": ["https://example.com/old-sitemap.xml"]

}

}

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes