Skip to main content

Utilities for Pandas and Apache Spark on AWS.

Project description

AWS Data Wrangler (BETA)

Code style: black

Utilities for Pandas and Apache Spark on AWS

AWS Data Wrangler aims to fill a gap between AWS Analytics Services (Glue, Athena, EMR, Redshift, S3) and the most popular Python data libraries (Pandas, Apache Spark).


Contents: Use Cases | Installation | Examples | License


Use Cases

  • Pandas Dataframe -> Parquet (S3)
  • Pandas Dataframe -> CSV (S3)
  • Pandas Dataframe -> Glue Catalog
  • Pandas Dataframe -> Redshift
  • Pandas Dataframe -> Athena
  • CSV (S3) -> Pandas Dataframe
  • Athena -> Pandas Dataframe
  • Spark Dataframe -> Redshift

Installation

pip install awswrangler

AWS Data Wrangler runs only Python 3.6 and beyond. And runs on AWS Lambda, AWS Glue, EC2, on-premises, local, etc.

Examples

Writing Pandas Dataframe to Data Lake

session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.

Reading from Data Lake to Pandas Dataframe

session = awswrangler.Session()
dataframe = session.pandas.read_sql_athena(
    sql="select * from table",
    database="database"
)

Reading from S3 file to Pandas Dataframe

session = awswrangler.Session()
dataframe = session.pandas.read_csv(path="s3://...")

Typical Pandas ETL

import pandas
import awswrangler

df = pandas.read_...  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

session = awswrangler.Session()
session.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

Loading Spark Dataframe to Redshift

session = awswrangler.Session(spark_session=spark)
session.spark.to_redshift(
    dataframe=df,
    path="s3://...",
    connection=conn,
    schema="public",
    table="table",
    iam_role="IAM_ROLE_ARN",
    mode="append",
)

License

This library is licensed under the Apache 2.0 License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awswrangler-0.0b14.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awswrangler-0.0b14-py36,py37-none-any.whl (20.4 kB view details)

Uploaded Python 3.6,py37

File details

Details for the file awswrangler-0.0b14.tar.gz.

File metadata

  • Download URL: awswrangler-0.0b14.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b14.tar.gz
Algorithm Hash digest
SHA256 e47ee16888dde7b52d084b2457fefb52a020b5da66ee66f8bb662eceee4479b7
MD5 8490831a40ed2c3409b1e43158c09c63
BLAKE2b-256 e877c6c96a447cfa6605079d33640028e027cf95d344ae909995b9aa5d90de31

See more details on using hashes here.

File details

Details for the file awswrangler-0.0b14-py36,py37-none-any.whl.

File metadata

  • Download URL: awswrangler-0.0b14-py36,py37-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3.6,py37
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b14-py36,py37-none-any.whl
Algorithm Hash digest
SHA256 a171aa245a93c8bced4e7f7c9693233245667360a90df623ef1a5be833e2e67c
MD5 926dd6d3c0a32564237763266922c058
BLAKE2b-256 088c5e65ce6b386990d97426940ec9ac1e0a4780d292fab470eb1a8ed1bf3bd1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page