Skip to main content

No project description provided

Project description

Build Status PyPI version License: MIT

S3 Web Cache

This is a simple package for archiving web pages (HTML) to S3. It acts as a cache returning the S3 version of the page if it exists. If not it gets the url through Requests and archives it in s3.

Our use case: provide a reusable history of pages included in a web scrape. An archived version of a particular URL at a moment in time. Since the web is always changing, different research questions can be asked at a later date, without losing the original content. Please only use in this manner if you have obtained permission for the pages you are requesting.

Quickstart

Install

pip install s3webcache

Usage

from s3webcache import S3WebCache

s3wc = S3WebCache(
    bucket_name=<BUCKET>,
    aws_access_key_id=<AWS_ACCESS_KEY_ID>,
    aws_secret_key=<AWS_SECRET_ACCESS_KEY>,
    aws_default_region=<AWS_DEFAULT_REGION>)

request = s3wc.get("https://en.wikipedia.org/wiki/Whole_Earth_Catalog")

if request.success:
    html = request.message

If the required AWS credentials are not given it will fallback to using environment variables.

The .get(url) operation returns a namedtuple Request: (success: bool, message: str).

For successful operations, .message contains the url data. For unsuccessful operations, .message contains error information.

Options

S3WebCache() takes the following arguments with these defaults:

  • bucket_name: str
  • path_prefix: str = None
    Subdirectories to store URLs. path_prefix='ht' will start archiving at path s3://BUCKETNAME/ht/
  • aws_access_key_id: str = None
  • aws_secret_key: str = None
  • aws_default_region: str = None
  • trim_website: bool = False Trim out the hostname. Defaults to storing the hostname as dot replaced underscores. https://github.com/wharton/S3WebCache would be s3://BUCKETNAME/github_com/wharton/S3WebCache.
    Set this to true and it will be stored as s3://BUCKETNAME/wharton/S3WebCache.
  • allow_forwarding: bool = True Will follow HTTP 300 class redirects.

TODO

  • Add 'update s3 if file is older than...' behavior
  • Add transparent compression by default (gzip, lz4, etc)
  • Add rate limiting

Reference

AWS S3 API documentation

License

MIT

Tests

Through Travis-ci

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

S3WebCache-0.1.8.tar.gz (4.1 kB view details)

Uploaded Source

File details

Details for the file S3WebCache-0.1.8.tar.gz.

File metadata

  • Download URL: S3WebCache-0.1.8.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for S3WebCache-0.1.8.tar.gz
Algorithm Hash digest
SHA256 72dd25d65aa02c7c50ac6e9d11711dc0083db5d2290d90a87e0d37c4d71778e0
MD5 18b0d41dd2973ba888292f2035adc2c9
BLAKE2b-256 f9989bdc4c4946c9be6e9fcb2e6eab255da129dfd7fe6396b2357f240f0c15fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page