s3select

S3 select utility package

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 2.7
- Python :: 3.7

Project description

S3 select

alt text Example query run on 10GB of GZIP compressed JSON data (>60GB uncompressed)

Motivation

Amazon S3 select is one of the coolest features AWS released in 2018. It's benefits are:

Very fast and low on network utilization as it allows you to return only a subset of the file contents from S3 using limited SQL select query. Since filtering of the data takes place on AWS machine where S3 file resides, network data transfer can be significantly limited depending on query issued.
Is lightweight on the client side because all filtering is done on a machine where the S3 data is located
It's cheap at $0.002 per GB scanned and $0.0007 per GB returned
For more details about S3 select see this presentation.

Unfortunately, S3 select API query call is limited to only one file on S3 and syntax is quite cumbersome, making it very impractical for daily usage. These are and more flaws are intended to be fixed with this s3select command.

Features at a glance

Most important features:

Queries all files beneath given S3 prefix(es)
The whole process is multi-threaded and fast. A scan of 1.1TB of data in stored in 20,000 files takes 5 minutes). Threads don't slow down client much as heavy lifting is done on AWS.
The compression of the file is automatically inferred for you by picking GZIP or plain text depending on file extension.
Real-time execution progress display.
The exact cost of the query returned for each run.
Ability to only count records matching the filter in a fast and efficient manner.
You can easily limit the number of results returned while still keeping multi-threaded execution.
Failed requests are properly handled and repeated if they are retriable (e.g. throttled calls).

Installation and Upgrade

s3select is developed in Python and uses pip.

The easiest way to install/upgrade s3select is to use pip in a virtualenv:

$ pip install -U s3select

or, if you are not installing in a virtualenv, to install/upgrade globally:

$ sudo pip install -U s3select

or for your user:

$ pip install --user -U s3select

Authentication

s3select uses the same authentication and endpoint configuration as aws-cli. If aws command is working on your machine, there is no need for any additional configuration.

Example usage

First get some help:

$ s3select -h
usage: s3select [-h] [-w WHERE] [-d FIELD_DELIMITER] [-D RECORD_DELIMITER]
                [-l LIMIT] [-v] [-c] [-H] [-o OUTPUT_FIELDS] [-t THREAD_COUNT]
                [--profile PROFILE] [-M MAX_RETRIES]
                prefixes [prefixes ...]

s3select makes s3 select querying API much easier and faster

positional arguments:
  prefixes              S3 prefix (or more) beneath which all files are
                        queried

optional arguments:
  -h, --help            show this help message and exit
  -w WHERE, --where WHERE
                        WHERE part of the SQL query
  -d FIELD_DELIMITER, --field_delimiter FIELD_DELIMITER
                        Field delimiter to be used for CSV files. If specified
                        CSV parsing will be used. By default we expect JSON
                        input
  -D RECORD_DELIMITER, --record_delimiter RECORD_DELIMITER
                        Record delimiter to be used for CSV files. If
                        specified CSV parsing will be used. By default we
                        expect JSON input
  -l LIMIT, --limit LIMIT
                        Maximum number of results to return
  -v, --verbose         Be more verbose
  -c, --count           Only count records without printing them to stdout
  -H, --with_filename   Output s3 path of a filename that contained the match
  -o OUTPUT_FIELDS, --output_fields OUTPUT_FIELDS
                        What fields or columns to output
  -t THREAD_COUNT, --thread_count THREAD_COUNT
                        How many threads to use when executing s3_select api
                        requests. Default of 150 seems to be on safe side. If
                        you increase this there is a chance you'll need also
                        to increase nr of open files on your OS
  --profile PROFILE     Use a specific AWS profile from your credential file.
  -M MAX_RETRIES, --max_retries MAX_RETRIES
                        Maximum number of retries per queried S3 object in
                        case API request fails

It's always useful to peek at first few lines of input files to figure out contents:

$ s3select -l 3 s3://testing.bucket/json_example/
{"name":"Gilbert","wins":[["straight","7♣"],["one pair","10♥"]]}
{"name":"Alexa","wins":[["two pair","4♠"],["two pair","9♠"]]}
{"name":"May","wins":[]}

It's JSON. Great - that's s3select default format. Let's get a subset of its data

$ s3select -l 3 -w "s.name LIKE '%Gil%'" -o "s.wins" s3://testing.bucket/json_example
{"wins":[["straight","7♣"],["one pair","10♥"]]}

What if the input is not in JSON:

$ s3select -l 3 s3://testing.bucket/csv_example
Exception caught when querying csv_example/example.csv: An error occurred (JSONParsingError) when calling the SelectObjectContent operation: Error parsing JSON file. Please check the file and try again.

Exception means input isn't parsable JSON. Let's switch to CSV file delimited with , but you can specify any other delimiter char. Often used is TAB specified with \\t

$ s3select -l 3 -d , s3://testing.bucket/csv_example
Gilbert,straight,7♣,one pair,10♥
Alexa,two pair,4♠,two pair,9♠
May,,,,

Since utilising the first line of CSV as a header isn't yet supported we'll select a subset of data using column enumeration:

$ s3select -l 3 -d , -w "s._1 LIKE '%i%'" -o "s._2" s3://testing.bucket/csv_example
straight
three of a kind

If you are interested in pricing for your requests, add -v to increase verbosity which will include pricing information at the end:

$ s3select -v -c s3://testing.bucket/10G_sample
Files processed: 77/77  Records matched: 5696395  Bytes scanned: 21 GB
Cost for data scanned: $0.02
Cost for data returned: $0.00
Cost for SELECT requests: $0.00
Total cost: $0.02

License

Distributed under the MIT license. See LICENSE for more information.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 2.7
- Python :: 3.7

Release history Release notifications | RSS feed

This version

0.0.14

Jul 22, 2019

0.0.13

Feb 12, 2019

0.0.12

Feb 6, 2019

0.0.11

Feb 1, 2019

0.0.10

Feb 1, 2019

0.0.9

Jan 30, 2019

0.0.8

Jan 3, 2019

0.0.7

Nov 22, 2018

0.0.5

Nov 19, 2018

0.0.4

Nov 18, 2018

0.0.2

Nov 18, 2018

0.0.1

Nov 18, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3select-0.0.14.tar.gz (8.5 kB view details)

Uploaded Jul 22, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

s3select-0.0.14-py3-none-any.whl (9.0 kB view details)

Uploaded Jul 22, 2019 Python 3

File details

Details for the file s3select-0.0.14.tar.gz.

File metadata

Download URL: s3select-0.0.14.tar.gz
Upload date: Jul 22, 2019
Size: 8.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/18.5 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/2.7.10

File hashes

Hashes for s3select-0.0.14.tar.gz
Algorithm	Hash digest
SHA256	`03747b87b44454a59548424ddddf936bbb2ed585a0d0a465c813e5364e014530`
MD5	`ce1752536c4cbc47d7a5741cab1059e4`
BLAKE2b-256	`3c470b9794b4612faa8b54c264f0bbbd6483ce2aa3cf96ae296a48a7feccc000`

See more details on using hashes here.

File details

Details for the file s3select-0.0.14-py3-none-any.whl.

File metadata

Download URL: s3select-0.0.14-py3-none-any.whl
Upload date: Jul 22, 2019
Size: 9.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/18.5 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/2.7.10

File hashes

Hashes for s3select-0.0.14-py3-none-any.whl
Algorithm	Hash digest
SHA256	`52e6bfb46832ea4eae86d0f5caf24feefdacdd46ac5c1cfd6aef7f26d42e3425`
MD5	`39b82e39acad73d0943b43a331ecc780`
BLAKE2b-256	`98a6bb6ecc46d8d9e771085aa3cc4f296d64f3f8e0f1b0f261359fb6162b7de2`

See more details on using hashes here.

s3select 0.0.14

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

S3 select

Motivation

Features at a glance

Installation and Upgrade

Authentication

Example usage

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes