Skip to main content

Utilities for Pandas and Apache Spark on AWS.

Project description

AWS Data Wrangler (BETA)

Code style: black

Utilities for Pandas and Apache Spark on AWS

AWS Data Wrangler aims to fill a gap between AWS Analytics Services (Glue, Athena, EMR, Redshift, S3) and the most popular Python data libraries (Pandas, Apache Spark).


Contents: Use Cases | Installation | Usage | License


Use Cases

  • Pandas Dataframe -> Parquet (S3)
  • Pandas Dataframe -> CSV (S3)
  • Pandas Dataframe -> Glue Catalog
  • Pandas Dataframe -> Redshift
  • Pandas Dataframe -> Athena
  • CSV (S3) -> Pandas Dataframe
  • Athena -> Pandas Dataframe
  • Spark Dataframe -> Redshift

Installation

pip install awswrangler

AWS Data Wrangler runs only Python 3.6 and beyond. And runs on AWS Lambda, AWS Glue, EC2, on-premises, local, etc.

Usage

Writing Pandas Dataframe to Data Lake

session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.

Reading from Data Lake to Pandas Dataframe

session = awswrangler.Session()
dataframe = session.pandas.read_sql_athena(
    sql="select * from table",
    database="database"
)

Reading from S3 file to Pandas Dataframe

session = awswrangler.Session()
dataframe = session.pandas.read_csv(path="s3://...")

Typical Pandas ETL

import pandas
import awswrangler

df = pandas.read_...  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

session = awswrangler.Session()
session.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

Loading Spark Dataframe to Redshift

session = awswrangler.Session(spark_session=spark)
session.spark.to_redshift(
    dataframe=df,
    path="s3://...",
    connection=conn,
    schema="public",
    table="table",
    iam_role="IAM_ROLE_ARN",
    mode="append",
)

License

This library is licensed under the Apache 2.0 License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awswrangler-0.0b13.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awswrangler-0.0b13-py36,py37-none-any.whl (20.3 kB view details)

Uploaded Python 3.6,py37

File details

Details for the file awswrangler-0.0b13.tar.gz.

File metadata

  • Download URL: awswrangler-0.0b13.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b13.tar.gz
Algorithm Hash digest
SHA256 b52991ca77a922e0b7185f13b7fea09a586a68cf44438e54bf92270ff24bfa1b
MD5 23cd614c325180bee3d96a0c2f5bb771
BLAKE2b-256 cc4b703a0b0fed671d1535e2ee5e259f139eba195678a66d48a7cb8f7a175c62

See more details on using hashes here.

File details

Details for the file awswrangler-0.0b13-py36,py37-none-any.whl.

File metadata

  • Download URL: awswrangler-0.0b13-py36,py37-none-any.whl
  • Upload date:
  • Size: 20.3 kB
  • Tags: Python 3.6,py37
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b13-py36,py37-none-any.whl
Algorithm Hash digest
SHA256 e0b4b1cb336d28f666feffa643292130a2e522da345069a6ade9bc5e8b32f972
MD5 8fa2c31d53576c89cfd5b25ef792f243
BLAKE2b-256 3a8c60b1c0ae74fed0550c6ffaf65913cd79e92dc0a37708bfb9724b7b6602c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page