S3 select utility package
Project description
S3 select
Example query run on 10GB of GZIP compressed JSON data (>60GB uncompressed)
Motivation
Amazon S3 select is one of the coolest features AWS released in 2018. It's benefits are:
- Very fast and low on network utilization as it allows you to return only subset of file contents from S3 using limited SQL select query. Since filtering of the data takes place on AWS machine where S3 file resides, network data transfer can be significantly limited depending on query issued.
- Is lightweight on client side because all filtering is done on machine where S3 data is located
- It's cheap at $0.002 per GB scanned and $0.0007 per GB returned
For more details about S3 select see this presentation.Unfortunately S3 select API query call is limited to only one file on S3 and syntax is quite cumbersome, making it very impractical for daily usage. These are and more flaws are intended to be fixed with this s3select command.
Features at a glance
Most important features:
- Queries all files beneath given S3 prefix
- Whole process is multi threaded and fast. Scan of 1.1TB of data in stored in 20,000 files takes 5 minutes). Threads don't slow down client much as heavy lifting is done on AWS.
- Format of the file is automatically inferred for you picking GZIP or plain text depending on file extension
- Real time progress
- Exact cost of the query returned for each run
- Ability to only count records matching the filter in fast and efficient manner
- You can easily limit number of results returned while still keeping multi threaded execution
- Failed requests are properly handled and repeated if they are retriable (e.g. throttled calls)
Installation
s3select is built in Python and uses pip. Here is how to install and updated it:
$ pip install -U s3select
Authentication
s3select uses the same authentication and endpoint configuration as aws-cli. If aws command is working on your machine, there is no need for any additional configuration.
Example usage
License
Distributed under the MIT license. See LICENSE
for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.