Skip to main content

Parses historical robots.txt files from Wayback Machine

Project description

PyPI version shields.io MIT license

Historical Robots.txt Parser

This is a small Python package that parses the historical robots.txt files from the Internet Archive's Wayback Machine and coerces the data into a CSV file for tracking addition and removal of Allow and Disallow rules by timestamp of addition, path, user-agent, rule type (optional). It's a fairly narrow use case but may be helpful to researchers or journalists.

It also includes a parser to coerce a robots.txt file into a dictionary.

Requirements

  • Python 3.7 or later

Installation

Install with Python

pip3 install historical-robots-txt-parser

Install with Git

This package was developed using Poetry, which greatly simplifies the experience of dealing with dependencies and everything. Using Poetry is strongly recommended.

git clone https://github.com/alexlitel/historical-robots-txt-parser
cd historical-robots-txt-parser
poetry install

There is a requirements.txt file included here, so you can also use pip3 install -r requirements.txt if you don't want to use Poetry.

Usage

There are two functions included in the package: parse_robots_txt and historical_scraper. historical_scraper scrapes the historical files for a domain from the Wayback Machine and exports to a CSV. parse_robots_txt makes a request to a robots.txt file, parses and coerces it to a dictionary. If you clone the repo, there's a file app.py which takes command line arguments for domains to scrape.

historical_scraper

Usage

from historical_robots import historical_scraper

historical_scraper('website.com', 'website.csv', <optional arguments>)

Parameters

parameter type required default value description
domain string true The domain to scrape records from. Only should be hostname without www.
file_path string true Path of CSV file to export to
accept_allow boolean false False Whether to allow parser to parse Allow rules and include those in dataset. Adds a new column to CSV for Rule to note Disallow or Allow rule. By default, function only checks Disallow rules.
skip_scrape_interval boolean false False Whether to skip the default sleep interval between each historical robots.txt request. True value may cause errors.
sleep_interval number false 0.05 Number of seconds to sleep in between robots.txt requests. Ignored if skip_scrape_interval is True
params dictionary false {} Key value pairs representing valid URL params for the Wayback CDX API

parse_robots_txt

Usage

from historical_robots import parse_robots_text

parse_robots_txt('https://www.website.com/robots.txt', False)

Parameters

parameter type required default value description
URL string true The URL to request robots.txt file from.
accept_allow boolean false False Whether to parse Allow rules. By default, function only checks Disallow rules.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

historical-robots-txt-parser-0.1.2.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

historical_robots_txt_parser-0.1.2-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file historical-robots-txt-parser-0.1.2.tar.gz.

File metadata

  • Download URL: historical-robots-txt-parser-0.1.2.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.7.7 Linux/5.3.0-1020-azure

File hashes

Hashes for historical-robots-txt-parser-0.1.2.tar.gz
Algorithm Hash digest
SHA256 db52bfb1a3a7fd42455956df2e63979e45d55a115f041e13a64b53537cc74c8d
MD5 60d5e4273400987890720dad394a0a79
BLAKE2b-256 06a236bebf92b4b54b46f1b7e5dbf6444da46c3b66c92be1e6639f86e3d423e8

See more details on using hashes here.

File details

Details for the file historical_robots_txt_parser-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for historical_robots_txt_parser-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 492f60ae770dbbc9608b6ed87aa1b49b2ebe2e5cb84f2ff8d3befcf4101a1fdc
MD5 c16479bd07badd0c845a75dfdd4b17c6
BLAKE2b-256 5a6ef5d2f45718ba5a9cf19ebb93d43fbdd36287995c9fbbcbbf4bd018567d3c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page