Skip to main content

An object-oriented interface for defining parquet datasets for AWS built on top of awswrangler and pandera

Project description

aws-parquet


PyPI version shields.io PyPI license PyPI pyversions Downloads Downloads

aws-parquet is a toolkit than enables working with parquet datasets on AWS. It handles AWS S3 reads/writes, AWS Glue catalog updates and AWS Athena queries by providing a simple and intuitive interface.

Motivation

The goal is to provide a simple and intuitive interface to create and manage parquet datasets on AWS.

aws-parquet makes use of the following tools:

Features

aws-parquet provides a ParquetDataset class that enables the following operations:

  • create a parquet dataset that will get registered in AWS Glue
  • append new data to the dataset and update the AWS Glue catalog
  • read a partition of the dataset and perform proper schema validation and type casting
  • overwrite data in the dataset after performing proper schema validation and type casting
  • delete a partition of the dataset and update the AWS Glue catalog
  • query the dataset using AWS Athena

How to setup

Using pip:

pip install aws_parquet

How to use

Create a parquet dataset that will get registered in AWS Glue

import os

from aws_parquet import ParquetDataset
import pandas as pd
import pandera as pa
from pandera.typing import Series

# define your pandera schema model
class MyDatasetSchemaModel(pa.SchemaModel):
    col1: Series[int] = pa.Field(nullable=False, ge=0, lt=10)
    col2: Series[pa.DateTime]
    col3: Series[float]

# configuration
database = "default"
bucket_name = os.environ["AWS_S3_BUCKET"]
table_name = "foo_bar"
path = f"s3://{bucket_name}/{table_name}/"
partition_cols = ["col1", "col2"]
schema = MyDatasetSchemaModel.to_schema()

# create the dataset
dataset = ParquetDataset(
    database=database,
    table=table_name,
    partition_cols=partition_cols,
    path=path,
    pandera_schema=schema,
)

dataset.create()

Append new data to the dataset

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["2021-01-01", "2021-01-02", "2021-01-03"],
    "col3": [1.0, 2.0, 3.0]
})

dataset.update(df)

Read a partition of the dataset

df = dataset.read({"col2": "2021-01-01"})

Overwrite data in the dataset

df_overwrite = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["2021-01-01", "2021-01-02", "2021-01-03"],
    "col3": [4.0, 5.0, 6.0]
})
dataset.update(df_overwrite, overwrite=True)

Query the dataset using AWS Athena

df = dataset.query("SELECT col1 FROM foo_bar")

Delete a partition of the dataset

dataset.delete({"col1": 1, "col2": "2021-01-01"})

Delete the dataset in its entirety

dataset.delete()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

aws_parquet-0.5.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file aws_parquet-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: aws_parquet-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.8.7 Darwin/22.5.0

File hashes

Hashes for aws_parquet-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9ec49bf2f11c73cab98e1d186c1fc118496681c33a62914cd52365932734be2
MD5 463aedae7b6ee5a7c564f20c373ffbe0
BLAKE2b-256 f72ec5b657fda0e9be441f9c941fe42ba992587145c72287c96db351dd9d278b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page