A Python package to check URL paths against robots directives a robots.txt file.
Project description
Robots Text Processor
A Python package for processing and validating robots.txt files according to the Robots Exclusion Protocol (REP) RFC.
Features
- Parse and process robots.txt files
- Extract user-agent rules, allow/disallow directives, and sitemaps
- Test URLs against robots.txt rules
- Validate robots.txt files for compliance with REP RFC
- Generate hashes for content tracking
- Comprehensive error and warning reporting
Installation
pip install robotstxt-package
Usage
Basic Usage
from robotstxt import robots_file
# Create a RobotsFile instance
robots = robots_file("""
User-agent: *
Allow: /
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
""")
# Test a URL
result = robots.test_url("https://example.com/private/page", "*")
print(result) # {'disallowed': True, 'matching_rule': '/private/'}
# Get sitemaps
for sitemap in robots.sitemaps:
print(sitemap.url)
Validation
The package includes comprehensive validation of robots.txt files:
from robotstxt import robots_file
# Create a RobotsFile instance
robots = robots_file("""
User-agent: *
Allow: /
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
""")
# Check for validation errors
if robots.has_errors():
for error in robots.get_validation_errors():
print(f"Error: {error.message} (Line {error.line_number})")
# Check for validation warnings
if robots.has_warnings():
for warning in robots.get_validation_warnings():
print(f"Warning: {warning.message} (Line {warning.line_number})")
The validation system checks for:
- File size exceeding 500KB
- UTF-8 Byte Order Mark (BOM) presence
- Invalid characters in lines
- Proper directive formatting (user-agent, allow, disallow, sitemap)
- Rule block structure (user-agent directives before allow/disallow rules)
- Valid sitemap URLs
- Duplicate rules for the same user-agent
- And more...
Content Tracking
The package provides hashing functionality for tracking changes:
from robotstxt import robots_file
robots = robots_file(content)
# Get content hash
print(robots.hash_raw) # SHA-256 hash of raw content
# Get rules hash
print(robots.hash_material) # SHA-256 hash of processed rules
# Get sitemaps hash
print(robots.hash_sitemaps) # SHA-256 hash of sitemap URLs
Validation Rules
The package validates robots.txt files according to the following rules:
-
File Size
- Warning if file exceeds 500KB
- Error if file exceeds 512KB
-
Character Encoding
- Error if file contains UTF-8 BOM
- Error if file contains invalid characters
-
Directive Format
- Error for invalid directive format
- Error for missing user-agent before allow/disallow rules
- Warning for common typos in directives
-
Rule Structure
- Error for duplicate rule blocks
- Warning for conflicting allow/disallow rules
- Warning for overly broad rules
-
Sitemaps
- Error for invalid sitemap URLs
- Warning for multiple sitemaps
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
compare_robots_files(robots1, robots2)
Compares two robots.txt files and generates a structured diff showing:
- Material differences between the files
- Per-token changes showing added and removed rules
- Sitemap changes
The function normalizes the rules before comparison to ensure accurate diffing regardless of:
- Rule ordering within token groups
- Whitespace differences
- Case sensitivity in tokens
- Duplicate rules
Returns a dictionary containing:
- materially_different: Boolean indicating if the files have different rules
- token_diffs: Dictionary of differences per token
- sitemap_changes: Dictionary of sitemap differences
Example usage
robots1 = RobotsFile(old_content) robots2 = RobotsFile(new_content)
Either way works:
diff = compare_robots_files(robots1, robots2)
or
diff = robots1.compare_with(robots2)
Example output structure:
{
"materially_different": True,
"token_diffs": {
"googlebot": {
"added": ["Allow: /new-path/", "Disallow: /private/"],
"removed": ["Disallow: /old-path/"]
},
"bingbot": {
"added": ["Disallow: /api/"],
"removed": []
}
},
"sitemap_changes": {
"added": ["https://example.com/new-sitemap.xml"],
"removed": ["https://example.com/old-sitemap.xml"]
}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file robotstxt-1.1.2.tar.gz.
File metadata
- Download URL: robotstxt-1.1.2.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab406b1f7d4d348403a0df1542425637c954b623102e67805391cb5de5adad54
|
|
| MD5 |
93ac0446a5a352a8a3b3b9032354f544
|
|
| BLAKE2b-256 |
981c17ea5d9b8548826ed790aeebee65e78d2e4756e93c671c16570cf38f2312
|
File details
Details for the file robotstxt-1.1.2-py3-none-any.whl.
File metadata
- Download URL: robotstxt-1.1.2-py3-none-any.whl
- Upload date:
- Size: 18.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38fcb6c6dea6566ceffeb1e44c88f1015a96585084d232fc9de151dbd25da1c6
|
|
| MD5 |
bd13de33de95404441e46089f5e4727d
|
|
| BLAKE2b-256 |
120eadc571886b6a4c41e1382972ee853895a14dd67a606867e24c12d482e40b
|