S3 select utility package
Project description
S3 select
Motivation
Amazon S3 select is one of the coolest features AWS released in 2018. It allows you to return only subset of file contents from S3 using limited SQL select query. Since filtering of the data takes place on AWS machine where S3 file resides, network data transfer can be significantly limited depending on query issued. This also dramatically improves query speeds. It's also very cheap at $0.002 per GB scanned and $0.0007 per GB returned
Great intro to S3 select is available here.
Unfortunately S3 select API query call is limited to only one file on S3 and syntax is quite cumbersome, making it very impractical for daily usage. These are and more flaws are intended to be fixed with this s3select command.
Features at a glance
Most important features:
- Queries all files beneath given S3 prefix
- Whole process is multi threaded and fast. Scan of 1.1TB of data in stored in 20,000 files takes 5 minutes). Threads don't slow down client much as heavy lifting is done on AWS.
- Format of the file is automatically inferred for you picking GZIP or plain text depending on file extension
- Real time progress
- Exact cost of the query returned for each run
- Ability to only count records matching the filter in fast and efficient manner
- You can easily limit number of results returned while still keeping multi threaded execution
- Failed requests are properly handled and repeated if they are retriable (e.g. throttled calls)
Installation
The easiest way to install s3select is to use pip:
$ pip install s3select
Authentication
S3 select uses boto3, so authentication and endpoint configuration is the same. For more details see aws-cli documentation.
Example usage
License
Distributed under the MIT license. See LICENSE
for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.