Skip to main content

Utilities for Pandas and Apache Spark on AWS.

Project description

AWS Data Wrangler (BETA)

Code style: black

Utilities for Pandas and Apache Spark on AWS

AWS Data Wrangler aims to fill a gap between AWS Analytics Services (Glue, Athena, EMR, Redshift, S3) and the most popular Python data libraries (Pandas, Apache Spark).


Contents: Use Cases | Installation | Examples | License


Use Cases

  • Pandas Dataframe -> Parquet (S3)
  • Pandas Dataframe -> CSV (S3)
  • Pandas Dataframe -> Glue Catalog
  • Pandas Dataframe -> Redshift
  • Pandas Dataframe -> Athena
  • CSV (S3) -> Pandas Dataframe
  • Athena -> Pandas Dataframe
  • Spark Dataframe -> Redshift

Installation

pip install awswrangler

AWS Data Wrangler runs only Python 3.6 and beyond. And runs on AWS Lambda, AWS Glue, EC2, on-premises, local, etc.

Examples

Writing Pandas Dataframe to Data Lake

session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.

Reading from Data Lake to Pandas Dataframe

session = awswrangler.Session()
dataframe = session.pandas.read_sql_athena(
    sql="select * from table",
    database="database"
)

Reading from S3 file to Pandas Dataframe

session = awswrangler.Session()
dataframe = session.pandas.read_csv(path="s3://...")

Typical Pandas ETL

import pandas
import awswrangler

df = pandas.read_...  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

session = awswrangler.Session()
session.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

Loading Spark Dataframe to Redshift

session = awswrangler.Session(spark_session=spark)
session.spark.to_redshift(
    dataframe=df,
    path="s3://...",
    connection=conn,
    schema="public",
    table="table",
    iam_role="IAM_ROLE_ARN",
    mode="append",
)

License

This library is licensed under the Apache 2.0 License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awswrangler-0.0b15.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awswrangler-0.0b15-py36,py37-none-any.whl (20.4 kB view details)

Uploaded Python 3.6,py37

File details

Details for the file awswrangler-0.0b15.tar.gz.

File metadata

  • Download URL: awswrangler-0.0b15.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b15.tar.gz
Algorithm Hash digest
SHA256 ea8051e13846b924232bb295ed4a57397dce7de478627807a3718ec358ca5dfc
MD5 7c58d4524ed11b8846ddfcfcf32a481a
BLAKE2b-256 681af70337512f19ef5f8b9ebfb19c153ad758e1aae8ace7654812da52444083

See more details on using hashes here.

File details

Details for the file awswrangler-0.0b15-py36,py37-none-any.whl.

File metadata

  • Download URL: awswrangler-0.0b15-py36,py37-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3.6,py37
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b15-py36,py37-none-any.whl
Algorithm Hash digest
SHA256 3dc74b6a341603e048427faa0d15e6d0677a99d1450ddb55e4f657fd50b34437
MD5 df9e8e2b488ea2a5ae32f0df4acd4f67
BLAKE2b-256 5b92dff40a6d8226fed881838f9ab42c5990154df1c27a81a62795921da68eed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page