Skip to main content

Write and read/query s3 parquet data using Athena/Spectrum/Hive style partitioning.

Project description

s3_parq

Parquet file management in S3 for hive-style partitioned data

What is this?

In many ways, parquet standards are still the wild west of data. Depending on your partitioning style, metadata store strategy etc. you can tackle the big data beast in a multitude of different ways. This is an AWS-specific solution intended to serve as an interface between python programs and any of the multitude of tools used to access this data. s3_parq is an end-to-end solution for:

  1. writing data from pandas dataframes to s3 as partitioned parquet.
  2. reading data from s3 partitioned parquet that was created by s3_parq to pandas dataframes.

NOTE: s3_parq writes (and reads) metadata into the s3 objects that is used to filter records before any file i/o; this makes selecting datasets faster, but also means you need to have written data with s3_parq to read it with s3_parq.

TLDR - to read with s3_parq, you need to have written with s3_parq

Basic Usage

we get data by dataset name.

from s3_parq import S3Parq

## writing to s3
parq = S3Parq(bucket='mybucket',dataset='my_dataset',dataframe=pandas_dataframe_to_write)
parq.publish()

## reading from s3, getting only records with an id >= 150
parq = S3Parq(bucket='mybucket',dataset='my_dataset',filter={"partition":"id,"values":150, "comparison":>=})
retrieved_dataframe = parq.fetch()

Gotchas

  • filters can only be applied to partitions; this is because we do not actually

Contribution

We welcome pull requests! Some basic guidelines:

  • test yo' code. code coverage is important!
  • be respectful. in pr comments, code comments etc;

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for s3parq, version 0.0.1
Filename, size File type Python version Upload date Hashes
Filename, size s3parq-0.0.1-py3.7.egg (46.1 kB) File type Egg Python version 3.7 Upload date Hashes View
Filename, size s3parq-0.0.1-py3-none-any.whl (21.5 kB) File type Wheel Python version py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page