This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

webcrystal is:

  1. An HTTP proxy and web service that saves every web page accessed through it to disk.
  2. An on-disk archival format for storing websites.

webcrystal is intended as a tool for archiving websites. It is also intended to be convenient to write HTTP-based and browser-based web scrapers on top of.

Features

  • Compact package: One .py file. Only one dependency (urllib3).
  • A simple documented archival format.
  • >95% code coverage, enforced by the test suite.
  • Friendly MIT license.
  • Excellent documentation.

Installation

  • Install Python 3.
  • From a command-line terminal (Terminal on OS X, Command Prompt on Windows), run the command:
pip3 install webcrystal

Quickstart

To start the proxy run a command like:

webcrystal.py 9227 xkcd.wbcr http://xkcd.com/

Then you can visit http://localhost:9227/ to have the same effect as visiting http://xkcd.com/ directly, except that all requests are archived in xkcd.wbcr/.

When you access an HTTP resource through the webcrystal proxy for the first time, it will be fetched from the origin HTTP server and archived locally. All subsequent requests for the same resource will be returned from the archive.

CLI

To start the webcrystal proxy:

webcrystal.py [--help] [--quiet] <port> <archive_dirpath> [<default_origin_domain>]

To stop the proxy press ^C or send a SIGINT signal to it.

Full Syntax

webcrystal.py --help

This outputs:

usage: webcrystal.py [-h] [-q] port archive_dirpath [default_origin_domain]

An archiving HTTP proxy and web service.

positional arguments:
  port                  Port on which to run the HTTP proxy. Suggest 9227
                        (WBCR).
  archive_dirpath       Path to the archive directory. Usually has .wbcr
                        extension.
  default_origin_domain
                        Default HTTP domain which the HTTP proxy will redirect
                        to if no URL is specified.

optional arguments:
  -h, --help            Show this help message and exit.
  -q, --quiet           Suppresses all output.

HTTP API

The HTTP API is the primary API for interacting with the webcrystal proxy.

While the proxy is running, it responds to the following HTTP endpoints.

Notice that GET is an accepted method for all endpoints, so that they can be easily requested using a regular web browser. Browser accessibility is convenient for manual inspection and browser-based website scrapers.

GET,HEAD /

Redirects to the home page of the default origin domain if it was specified at the CLI. Returns:

  • HTTP 404 (Not Found) if no default origin domain is specified (the default) or
  • HTTP 307 (Temporary Redirect) to the default origin domain if it is specified.

GET,HEAD /_/http[s]/__PATH__

If in online mode (the default):

  • The requested resource will be fetched from the origin server and added to the archive if:
      1. it is not already archived,
      1. a Cache-Control=no-cache header is specified, or
      1. a Pragma=no-cache header is specified.
  • The newly archived resource will be returned to the client, with all URLs in HTTP headers and content rewritten to point to the proxy.

If in offline mode:

  • If the resource is in the archive, it will be returned to the client, with all URLs in HTTP headers and content rewritten to point to the proxy.
  • If the resource is not in the archive, an HTTP 503 (Service Unavailable) response will be returned, with an HTML page that provides a link to the online version of the content.

POST,GET /_online

Switches the proxy to online mode.

POST,GET /_offline

Switches the proxy to offline mode.

POST,GET /_refresh/http[s]/__PATH__

Refetches the specified URL from the origin server using the same request headers as the last time it was fetched. Returns:

  • HTTP 200 (OK) if successful or
  • HTTP 404 (Not Found) if the specified URL was not in the archive.

POST,GET /_delete/http[s]/__PATH__

Deletes the specified URL in the archive. Returns:

  • HTTP 200 (OK) if successful or
  • HTTP 404 (Not Found) if the specified URL was not in the archive.

Archival Format

When the proxy is started with a command like:

webcrystal.py 9227 website.wbcr

It creates an archive in the directory website.wbcr/ in the following format:

website.wbcr/index.txt

  • Lists the URL of each archived HTTP resource, one per line.
  • UTF-8 encoded text file with Unix line endings (\n).

Example:

http://xkcd.com/
http://xkcd.com/s/b0dcca.css
http://xkcd.com/1645/

The preceding example archive contains 3 HTTP resources, numbered #1, #2, and #3.

website.wbcr/1.request_headers.json

  • Contains the HTTP request headers sent to the origin HTTP server to obtain HTTP resource #1.
  • UTF-8 encoded JSON file.

Example:

{"Accept-Language": "en-us", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Host": "xkcd.com", "Accept-Encoding": "gzip, deflate", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/601.4.4 (KHTML, like Gecko) Version/9.0.3 Safari/601.4.4"}

website.wbcr/1.response_headers.json

  • Contains the HTTP response headers received from the origin HTTP server when obtaining HTTP resource #1.
  • UTF-8 encoded JSON file.
  • Contains an internal “X-Status-Code” header that indicates the HTTP status code received from the origin HTTP server.

Example:

{"Cache-Control": "public", "Connection": "keep-alive", "Accept-Ranges": "bytes", "X-Cache-Hits": "0", "Date": "Tue, 15 Mar 2016 04:37:05 GMT", "Age": "0", "X-Served-By": "cache-sjc3628-SJC", "Content-Type": "text/html", "Server": "lighttpd/1.4.28", "X-Status-Code": "404", "X-Cache": "MISS", "Content-Length": "345", "X-Timer": "S1458016625.375814,VS0,VE148", "Via": "1.1 varnish"}

website.wbcr/1.response_body.dat

  • Contains the contents of the HTTP response body received from the origin HTTP server when obtaining HTTP resource #1.
  • Binary file.

Contributing

Install Dev Requirements

pip3 install -r dev-requirements.txt

Run the Tests

make test

Gather Code Coverage Metrics

make coverage
open htmlcov/index.html

Upload a New Version to PyPI

  • Ensure the changelog is updated.
  • Bump the version number in setup.py.
  • python3 setup.py sdist bdist_wheel upload
  • Tag the release in Git.

Known Limitations

  • Sites that vary the content served at a particular URL depending on whether you are logged in can have only one version of the URL archived.

License

This code is provided under the MIT License. See LICENSE file for details.

Changelog

  • v1.0.1
    • More robust support for HTTPS URLs on OS X 10.11.
    • Validate HTTPS certificates.
  • v1.0 - Initial release
Release History

Release History

1.0.1

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
webcrystal-1.0.1-py3-none-any.whl (23.6 kB) Copy SHA256 Checksum SHA256 3.4 Wheel Apr 18, 2016
webcrystal-1.0.1.tar.gz (13.2 kB) Copy SHA256 Checksum SHA256 Source Apr 18, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting