No project description provided
Project description
S3 Web Cache
This is a simple package for archiving web pages (HTML) to S3. It acts as a cache returning the S3 version of the page if it exists. If not it gets the url through Requests and archives it in s3.
Our use case: provide a reusable history of pages included in a web scrape. An archived version of a particular URL at a moment in time. Since the web is always changing, different research questions can be asked at a later date, without losing the original content. Please only use in this manner if you have obtained permission for the pages you are requesting.
Quickstart
Install
pip install s3webcache
Usage
from s3webcache import S3WebCache
s3wc = S3WebCache(
bucket_name=<BUCKET>,
aws_access_key_id=<AWS_ACCESS_KEY_ID>,
aws_secret_key=<AWS_SECRET_ACCESS_KEY>,
aws_default_region=<AWS_DEFAULT_REGION>)
request = s3wc.get("https://en.wikipedia.org/wiki/Whole_Earth_Catalog")
if request.success:
html = request.message
If the required AWS credentials are not given it will fallback to using environment variables.
The .get(url)
operation returns a namedtuple Request: (success: bool, message: str).
For successful operations, .message
contains the url data.
For unsuccessful operations, .message
contains error information.
Options
S3WebCache() takes the following arguments with these defaults:
- bucket_name: str
- path_prefix: str = None
Subdirectories to store URLs.path_prefix='ht'
will start archiving at path s3://BUCKETNAME/ht/ - aws_access_key_id: str = None
- aws_secret_key: str = None
- aws_default_region: str = None
- trim_website: bool = False
Trim out the hostname. Defaults to storing the hostname as dot replaced underscores.
https://github.com/wharton/S3WebCache
would bes3://BUCKETNAME/github_com/wharton/S3WebCache
.
Set this to true and it will be stored ass3://BUCKETNAME/wharton/S3WebCache
. - allow_forwarding: bool = True Will follow HTTP 300 class redirects.
TODO
- Add 'update s3 if file is older than...' behavior
- Add transparent compression by default (gzip, lz4, etc)
- Add rate limiting
Reference
License
MIT
Tests
Through Travis-ci
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file S3WebCache-0.2.0.tar.gz
.
File metadata
- Download URL: S3WebCache-0.2.0.tar.gz
- Upload date:
- Size: 4.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54cb044a6930e1ba8cc7d7e73443f95fb546f8b1ee3497e5e44e0d58d818059f |
|
MD5 | 125d97dfd37678fb14467cf0b52ffffd |
|
BLAKE2b-256 | 54dc12c8f756629b7c89466fb76926aed3ca9745408677e47221c29e9166c214 |