Parses historical robots.txt files from Wayback Machine

These details have not been verified by PyPI

Project links

Project description

Historical Robots.txt Parser

This is a small Python package that parses the historical robots.txt files from the Internet Archive's Wayback Machine and coerces the data into a CSV file for tracking addition and removal of Allow and Disallow rules by timestamp of addition, path, user-agent, rule type (optional). It's a fairly narrow use case but may be helpful to researchers or journalists.

It also includes a parser to coerce a robots.txt file into a dictionary.

Requirements

Python 3.7 or later

Installation

Install with Python

pip3 install historical-robots-txt-parser

Install with Git

This package was developed using Poetry, which greatly simplifies the experience of dealing with dependencies and everything. Using Poetry is strongly recommended.

git clone https://github.com/alexlitel/historical-robots-txt-parser
cd historical-robots-txt-parser
poetry install

There is a requirements.txt file included here, so you can also use pip3 install -r requirements.txt if you don't want to use Poetry.

Usage

There are two functions included in the package: parse_robots_txt and historical_scraper. historical_scraper scrapes the historical files for a domain from the Wayback Machine and exports to a CSV. parse_robots_txt makes a request to a robots.txt file, parses and coerces it to a dictionary. If you clone the repo, there's a file app.py which takes command line arguments for domains to scrape.

historical_scraper

Usage

from historical_robots import historical_scraper

historical_scraper('website.com', 'website.csv', <optional arguments>)

Parameters

parameter	type	required	default value	description
domain	string	true		The domain to scrape records from. Only should be hostname without `www`.
file_path	string	true		Path of CSV file to export to
accept_allow	boolean	false	False	Whether to allow parser to parse `Allow` rules and include those in dataset. Adds a new column to CSV for `Rule` to note `Disallow` or `Allow` rule. By default, function only checks `Disallow` rules.
skip_scrape_interval	boolean	false	False	Whether to skip the default sleep interval between each historical robots.txt request. `True` value may cause errors.
sleep_interval	number	false	0.05	Number of seconds to sleep in between robots.txt requests. Ignored if `skip_scrape_interval` is `True`
params	dictionary	false	{}	Key value pairs representing valid URL params for the Wayback CDX API

parse_robots_txt

Usage

from historical_robots import parse_robots_text

parse_robots_txt('https://www.website.com/robots.txt', False)

Parameters

parameter	type	required	default value	description
URL	string	true		The URL to request robots.txt file from.
accept_allow	boolean	false	False	Whether to parse `Allow` rules. By default, function only checks `Disallow` rules.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

May 10, 2020

0.1.0

May 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

historical-robots-txt-parser-0.1.2.tar.gz (5.2 kB view details)

Uploaded May 10, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

historical_robots_txt_parser-0.1.2-py3-none-any.whl (5.8 kB view details)

Uploaded May 10, 2020 Python 3

File details

Details for the file historical-robots-txt-parser-0.1.2.tar.gz.

File metadata

Download URL: historical-robots-txt-parser-0.1.2.tar.gz
Upload date: May 10, 2020
Size: 5.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.0.5 CPython/3.7.7 Linux/5.3.0-1020-azure

File hashes

Hashes for historical-robots-txt-parser-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`db52bfb1a3a7fd42455956df2e63979e45d55a115f041e13a64b53537cc74c8d`
MD5	`60d5e4273400987890720dad394a0a79`
BLAKE2b-256	`06a236bebf92b4b54b46f1b7e5dbf6444da46c3b66c92be1e6639f86e3d423e8`

See more details on using hashes here.

File details

Details for the file historical_robots_txt_parser-0.1.2-py3-none-any.whl.

File metadata

Download URL: historical_robots_txt_parser-0.1.2-py3-none-any.whl
Upload date: May 10, 2020
Size: 5.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.0.5 CPython/3.7.7 Linux/5.3.0-1020-azure

File hashes

Hashes for historical_robots_txt_parser-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`492f60ae770dbbc9608b6ed87aa1b49b2ebe2e5cb84f2ff8d3befcf4101a1fdc`
MD5	`c16479bd07badd0c845a75dfdd4b17c6`
BLAKE2b-256	`5a6ef5d2f45718ba5a9cf19ebb93d43fbdd36287995c9fbbcbbf4bd018567d3c`

See more details on using hashes here.

historical-robots-txt-parser 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Historical Robots.txt Parser

Requirements

Installation

Install with Python

Install with Git

Usage

historical_scraper

Usage

Parameters

parse_robots_txt

Usage

Parameters

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes