An object-oriented interface for defining parquet datasets for AWS built on top of awswrangler and pandera
Project description
aws-parquet
aws-parquet
is a toolkit than enables working with parquet datasets on AWS. It handles AWS S3 reads/writes, AWS Glue catalog updates and AWS Athena queries by providing a simple and intuitive interface.
Motivation
The goal is to provide a simple and intuitive interface to create and manage parquet datasets on AWS.
aws-parquet
makes use of the following tools:
- awswrangler as an AWS SDK for pandas
- pandera for pandas-based data validation
- typeguard and pydantic for runtime type checking
Features
aws-parquet
provides a ParquetDataset
class that enables the following operations:
- create a parquet dataset that will get registered in AWS Glue
- append new data to the dataset and update the AWS Glue catalog
- read a partition of the dataset and perform proper schema validation and type casting
- overwrite data in the dataset after performing proper schema validation and type casting
- delete a partition of the dataset and update the AWS Glue catalog
- query the dataset using AWS Athena
How to setup
Using pip:
pip install aws_parquet
How to use
Create a parquet dataset that will get registered in AWS Glue
import os
from aws_parquet import ParquetDataset
import pandas as pd
import pandera as pa
from pandera.typing import Series
# define your pandera schema model
class MyDatasetSchemaModel(pa.SchemaModel):
col1: Series[int] = pa.Field(nullable=False, ge=0, lt=10)
col2: Series[pa.DateTime]
col3: Series[float]
# configuration
database = "default"
bucket_name = os.environ["AWS_S3_BUCKET"]
table_name = "foo_bar"
path = f"s3://{bucket_name}/{table_name}/"
partition_cols = ["col1", "col2"]
schema = MyDatasetSchemaModel.to_schema()
# create the dataset
dataset = ParquetDataset(
database=database,
table=table_name,
partition_cols=partition_cols,
path=path,
pandera_schema=schema,
)
dataset.create()
Append new data to the dataset
df = pd.DataFrame({
"col1": [1, 2, 3],
"col2": ["2021-01-01", "2021-01-02", "2021-01-03"],
"col3": [1.0, 2.0, 3.0]
})
dataset.update(df)
Read a partition of the dataset
df = dataset.read({"col2": "2021-01-01"})
Overwrite data in the dataset
df_overwrite = pd.DataFrame({
"col1": [1, 2, 3],
"col2": ["2021-01-01", "2021-01-02", "2021-01-03"],
"col3": [4.0, 5.0, 6.0]
})
dataset.update(df_overwrite, overwrite=True)
Query the dataset using AWS Athena
df = dataset.query("SELECT col1 FROM foo_bar")
Delete a partition of the dataset
dataset.delete({"col1": 1, "col2": "2021-01-01"})
Delete the dataset in its entirety
dataset.delete()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file aws_parquet-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: aws_parquet-0.5.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.0 CPython/3.8.7 Darwin/22.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9ec49bf2f11c73cab98e1d186c1fc118496681c33a62914cd52365932734be2 |
|
MD5 | 463aedae7b6ee5a7c564f20c373ffbe0 |
|
BLAKE2b-256 | f72ec5b657fda0e9be441f9c941fe42ba992587145c72287c96db351dd9d278b |