Skip to main content

The missing Link between AWS services and the most popular Python data libraries

Project description

AWS Data Wrangler (BETA)

Code style: black Documentation Status

The missing link between AWS services and the most popular Python data libraries.

CAUTION: This project is in BETA version. And was not tested in battle yet.

Read the docs!

Check how AWS Wrangler can process small data more than 10x cheaper and 10x faster than Spark!

AWS Data Wrangler aims to fill a gap between AWS Analytics Services (Glue, Athena, EMR, Redshift) and the most popular Python libraries for lightweight workloads.

The rationale behind AWS Data Wrangler is to use the right tool for each job. And this project was developed with the lightweight jobs in mind. That is never so clear and depends of a lot of different factors, but a good rule of thumb that we discovered during the tests is that if your workload is something around 5 GB in plan text or less, so you should go with AWS Data Wrangler instead of the big data tools.

Usually there are two different types of use cases when dealing with data, heavy workloads which are dealt better using distributed tools and services like EMR/Spark and lightweight workloads that can be treated most efficient using simpler tools, and this is when aws data wrangler comes into action.

For example, in AWS Glue you can choose between two different types of Job, distributed with Apache Spark or single node with Python Shell. In this case data wrangler would use the single node with Python Shell job option (Or even AWS Lambda), resulting in less cost and more speed.

Rationale Image


Contents: Installation | Usage | Known Limitations | Contributing | Dependencies | License


Installation

pip install awswrangler

AWS Data Wrangler runs only Python 3.6 and beyond. And runs on AWS Lambda, AWS Glue, EC2, on-premises and local.

P.S. The Lambda Layer bundle and the Glue egg are available to download. It's just upload to your account and run! :rocket:

Usage

Writing Pandas Dataframe to Data Lake:

session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.

Reading from Data Lake to Pandas Dataframe:

session = awswrangler.Session()
dataframe = session.pandas.read_sql_athena(
    sql="select * from table",
    database="database"
)

Reading from S3 file to Pandas Dataframe:

session = awswrangler.Session()
dataframe = session.pandas.read_csv(path="s3://...")

Typical ETL:

import pandas
import awswrangler

df = pandas.read_...  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

session = awswrangler.Session()
session.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=dataframe,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

Dependencies

AWS Data Wrangler project relies on others great initiatives:

Known Limitations

  • By now only writes in Parquet and CSV file formats
  • By now there are not compression support
  • By now there are not nested type support

Contributing

For almost all features we need rely on AWS Services that didn't have mock tools in the community yet (AWS Glue, AWS Athena). So we are focusing on integration tests instead unit tests.

So, you will need provide a S3 bucket and a Glue/Athena database through environment variables.

export AWSWRANGLER_TEST_BUCKET=...

export AWSWRANGLER_TEST_DATABASE=...

CAUTION: This may this may incur costs in your AWS Account

make init

Make your changes...

make format

make lint

make test

License

This library is licensed under the Apache 2.0 License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awswrangler-0.0b6.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awswrangler-0.0b6-py36,py37-none-any.whl (18.2 kB view details)

Uploaded Python 3.6,py37

File details

Details for the file awswrangler-0.0b6.tar.gz.

File metadata

  • Download URL: awswrangler-0.0b6.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b6.tar.gz
Algorithm Hash digest
SHA256 b24ad7e34bfcae495c8effbade942cc81eb3d634632d7d999215140553d8b370
MD5 0cbecb7c0faf0ba404d29a312068d590
BLAKE2b-256 9f832da5fa6eee4e70d25f278a1285c92834ba8a4290645d97f9d54749cefc83

See more details on using hashes here.

File details

Details for the file awswrangler-0.0b6-py36,py37-none-any.whl.

File metadata

  • Download URL: awswrangler-0.0b6-py36,py37-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3.6,py37
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for awswrangler-0.0b6-py36,py37-none-any.whl
Algorithm Hash digest
SHA256 7468a35ba3f131ad4297e24d5d3b6f8a3ed23b706ce67f4bcf6f2f6ecb463a63
MD5 80eacf8fb814431255e006f855a2fafb
BLAKE2b-256 f5b0ce08a072c2d2f8dea1563875c14763559047b2de6edd0873d9f0b49c39cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page