Parses historical robots.txt files from Wayback Machine
Project description
Historical Robots.txt Parser
This is a small Python package that parses the historical robots.txt files from the Internet Archive's Wayback Machine and coerces the data into a CSV file for tracking addition and removal of Allow and Disallow rules by timestamp of addition, path, user-agent, rule type (optional). It's a fairly narrow use case but may be helpful to researchers or journalists.
It also includes a parser to coerce a robots.txt file into a dictionary.
Requirements
- Python 3.7 or later
Installation
Install with Python
pip3 install historical-robots-txt-parser
Install with Git
This package was developed using Poetry, which greatly simplifies the experience of dealing with dependencies and everything. Using Poetry is strongly recommended.
git clone https://github.com/alexlitel/historical-robots-txt-parser
cd historical-robots-txt-parser
poetry install
There is a requirements.txt file included here, so you can also use pip3 install -r requirements.txt if you don't want to use Poetry.
Usage
There are two functions included in the package: parse_robots_txt and historical_scraper. historical_scraper scrapes the historical files for a domain from the Wayback Machine and exports to a CSV. parse_robots_txt makes a request to a robots.txt file, parses and coerces it to a dictionary. If you clone the repo, there's a file app.py which takes command line arguments for domains to scrape.
historical_scraper
Usage
from historical_robots import historical_scraper
historical_scraper('website.com', 'website.csv', <optional arguments>)
Parameters
| parameter | type | required | default value | description |
|---|---|---|---|---|
| domain | string | true | The domain to scrape records from. Only should be hostname without www. |
|
| file_path | string | true | Path of CSV file to export to | |
| accept_allow | boolean | false | False | Whether to allow parser to parse Allow rules and include those in dataset. Adds a new column to CSV for Rule to note Disallow or Allow rule. By default, function only checks Disallow rules. |
| skip_scrape_interval | boolean | false | False | Whether to skip the default sleep interval between each historical robots.txt request. True value may cause errors. |
| sleep_interval | number | false | 0.05 | Number of seconds to sleep in between robots.txt requests. Ignored if skip_scrape_interval is True |
| params | dictionary | false | {} | Key value pairs representing valid URL params for the Wayback CDX API |
parse_robots_txt
Usage
from historical_robots import parse_robots_text
parse_robots_txt('https://www.website.com/robots.txt', False)
Parameters
| parameter | type | required | default value | description |
|---|---|---|---|---|
| URL | string | true | The URL to request robots.txt file from. | |
| accept_allow | boolean | false | False | Whether to parse Allow rules. By default, function only checks Disallow rules. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file historical-robots-txt-parser-0.1.2.tar.gz.
File metadata
- Download URL: historical-robots-txt-parser-0.1.2.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.7.7 Linux/5.3.0-1020-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db52bfb1a3a7fd42455956df2e63979e45d55a115f041e13a64b53537cc74c8d
|
|
| MD5 |
60d5e4273400987890720dad394a0a79
|
|
| BLAKE2b-256 |
06a236bebf92b4b54b46f1b7e5dbf6444da46c3b66c92be1e6639f86e3d423e8
|
File details
Details for the file historical_robots_txt_parser-0.1.2-py3-none-any.whl.
File metadata
- Download URL: historical_robots_txt_parser-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.7.7 Linux/5.3.0-1020-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
492f60ae770dbbc9608b6ed87aa1b49b2ebe2e5cb84f2ff8d3befcf4101a1fdc
|
|
| MD5 |
c16479bd07badd0c845a75dfdd4b17c6
|
|
| BLAKE2b-256 |
5a6ef5d2f45718ba5a9cf19ebb93d43fbdd36287995c9fbbcbbf4bd018567d3c
|