Skip to main content

No project description provided

Project description

Build Status PyPI version License: MIT

S3 Web Cache

This is a simple package for archiving web pages (HTML) to S3. It acts as a cache returning the S3 version of the page if it exists. If not it gets the url through Requests and archives it in s3.

Our use case: provide a reusable history of pages included in a web scrape. An archived version of a particular URL at a moment in time. Since the web is always changing, different research questions can be asked at a later date, without losing the original content. Please only use in this manner if you have obtained permission for the pages you are requesting.

Quickstart

Install

pip install s3webcache

Usage

from s3webcache import S3WebCache

s3wc = S3WebCache(
    bucket_name=<BUCKET>,
    aws_access_key_id=<AWS_ACCESS_KEY_ID>,
    aws_secret_key=<AWS_SECRET_ACCESS_KEY>,
    aws_default_region=<AWS_DEFAULT_REGION>)

request = s3wc.get("https://en.wikipedia.org/wiki/Whole_Earth_Catalog")

if request.success:
    html = request.message

If the required AWS credentials are not given it will fallback to using environment variables.

The .get(url) operation returns a namedtuple Request: (success: bool, message: str).

For successful operations, .message contains the url data. For unsuccessful operations, .message contains error information.

Options

S3WebCache() takes the following arguments with these defaults:

  • bucket_name: str
  • path_prefix: str = None
    Subdirectories to store URLs. path_prefix='ht' will start archiving at path s3://BUCKETNAME/ht/
  • aws_access_key_id: str = None
  • aws_secret_key: str = None
  • aws_default_region: str = None
  • trim_website: bool = False Trim out the hostname. Defaults to storing the hostname as dot replaced underscores. https://github.com/wharton/S3WebCache would be s3://BUCKETNAME/github_com/wharton/S3WebCache.
    Set this to true and it will be stored as s3://BUCKETNAME/wharton/S3WebCache.
  • allow_forwarding: bool = True Will follow HTTP 300 class redirects.

TODO

  • Add 'update s3 if file is older than...' behavior
  • Add transparent compression by default (gzip, lz4, etc)
  • Add rate limiting

Reference

AWS S3 API documentation

License

MIT

Tests

Through Travis-ci

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

S3WebCache-0.1.9.tar.gz (4.1 kB view details)

Uploaded Source

File details

Details for the file S3WebCache-0.1.9.tar.gz.

File metadata

  • Download URL: S3WebCache-0.1.9.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for S3WebCache-0.1.9.tar.gz
Algorithm Hash digest
SHA256 8be1127fbeb4ced698f901928efe2c0acb60cce96cfe4ef80931f5e843f77625
MD5 ad5e88c056a800f7ef7168c4545389b6
BLAKE2b-256 ada7df769e7500c607fc4252ca138764d4cad5be439c3b1825abe46fd73bfacf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page