Skip to main content

The missing Link between AWS services and the most popular Python data libraries

Project description

AWS Data Wrangler (BETA)

Code style: black

Utilities for Pandas and Apache Spark on AWS

AWS Data Wrangler aims to fill a gap between AWS Analytics Services (Glue, Athena, EMR, Redshift) and the most popular Python data libraries (Pandas, Apache Spark).


Contents: Use Cases | Installation | Usage | Rationale | Dependencies | Known Limitations | Contributing | License


Use Cases

  • Pandas Dataframe -> Parquet (S3)
  • Pandas Dataframe -> CSV (S3)
  • Pandas Dataframe -> Glue Catalog
  • Pandas Dataframe -> Redshift
  • Pandas Dataframe -> Athena
  • CSV (S3) -> Pandas Dataframe
  • Athena -> Pandas Dataframe
  • Spark Dataframe -> Redshift

Installation

pip install awswrangler

AWS Data Wrangler runs only Python 3.6 and beyond. And runs on AWS Lambda, AWS Glue, EC2, on-premises, local, etc.

P.S. The Lambda Layer bundle and the Glue egg are available to download. It's just upload to your account and run! :rocket:

Usage

Writing Pandas Dataframe to Data Lake:

session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.

Reading from Data Lake to Pandas Dataframe:

session = awswrangler.Session()
dataframe = session.pandas.read_sql_athena(
    sql="select * from table",
    database="database"
)

Reading from S3 file to Pandas Dataframe:

session = awswrangler.Session()
dataframe = session.pandas.read_csv(path="s3://...")

Typical Pandas ETL:

import pandas
import awswrangler

df = pandas.read_...  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

session = awswrangler.Session()
session.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

Loading Spark Dataframe to Redshift:

session = awswrangler.Session(spark_session=spark)
session.spark.to_redshift(
    dataframe=df,
    path="s3://...",
    connection=conn,
    schema="public",
    table="table",
    iam_role="IAM_ROLE_ARN",
    mode="append",
)

Rationale

The rationale behind AWS Data Wrangler is to use the right tool for each job. This project was developed to support two kinds of challenges: Small data (Pandas) and Big Data (Apache Spark). That is never so clear choose the right tool to wrangle your data, that depends of a lot of different factors, but a good rule of thumb that we discovered during the tests is that if your workload is something around 5 GB in plan text or less, so you should go with Pandas, otherwise go with Apache Spark.

For example, in AWS Glue you can choose between two different types of Job, distributed with Apache Spark or single node with Python Shell.

Bellow we can see an illustration exemplifying how you can go faster and cheaper even with the simples solution.

Rationale Image

Dependencies

AWS Data Wrangler project relies on others great initiatives:

Known Limitations

  • By now only writes in Parquet and CSV file formats
  • By now there are not compression support
  • By now there are not nested type support

Contributing

For almost all features we need rely on AWS Services that didn't have mock tools in the community yet (AWS Glue, AWS Athena). So we are focusing on integration tests instead unit tests.

License

This library is licensed under the Apache 2.0 License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awswrangler-0.0b9.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awswrangler-0.0b9-py36,py37-none-any.whl (20.5 kB view details)

Uploaded Python 3.6,py37

File details

Details for the file awswrangler-0.0b9.tar.gz.

File metadata

  • Download URL: awswrangler-0.0b9.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b9.tar.gz
Algorithm Hash digest
SHA256 5e464f429b546385bfa74398c5bf2564857a0bd33f56ac8d5f02a0d6227ed2a2
MD5 3ae0f655ae9010cbd51908e58ede90e9
BLAKE2b-256 02aa5ff20cac5af3edf9437db2fd92e4eeb8ce8682175b441263c288d6cccbf3

See more details on using hashes here.

File details

Details for the file awswrangler-0.0b9-py36,py37-none-any.whl.

File metadata

  • Download URL: awswrangler-0.0b9-py36,py37-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3.6,py37
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b9-py36,py37-none-any.whl
Algorithm Hash digest
SHA256 98b99b874a6154110a6a7fdaca394c546d673952f7202246549d81ba1dc9b2fb
MD5 8a38a63ec76bbdcfc9eadf8b580c651c
BLAKE2b-256 afb4490aca5fdbb95c2e671ccd760f58604205a305825507ec8c7c68f1c5be3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page