Skip to main content

The missing Link between AWS services and the most popular Python data libraries

Project description

AWS Data Wrangler (BETA)

Code style: black

The missing link between AWS services and the most popular Python data libraries.

The right tool for each job.

CAUTION: This project is in BETA version. And was not tested in battle yet.

AWS Data Wrangler aims to fill a gap between AWS Analytics Services (Glue, Athena, EMR, Redshift) and the most popular Python libraries for lightweight workloads.

The rationale behind AWS Data Wrangler is to use the right tool for each job. That is never so clear and depends of a lot of different factors, but a good rule of thumb that we discoverd during the tests is that if your workload is something around 5 GB in plan text or less, so you should go with AWS Data Wrangler instead of the consagrated big data tools.

AWS Glue is perfect to help illustrate the rationale. There are two different types of Job, distributed with Apache Spark or single node with Python Shell.

Rationale Image


Contents: Installation | Usage | Known Limitations | Contributing | Dependencies | License


Installation

pip install awswrangler

AWS Data Wrangler runs on Python 2 and 3.

Usage

Writing Pandas Dataframe to Data Lake:

awswrangler.s3.write(
        df=df,
        database="database",
        path="s3://...",
        file_format="parquet",
        preserve_index=True,
        mode="overwrite",
        partition_cols=["col"],
    )

If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.

Reading from Data Lake to Pandas Dataframe:

df = awswrangler.athena.read("database", "select * from table")

Typical ETL:

import pandas
import awswrangler

df = pandas.read_csv("s3//your_bucket/your_object.csv")  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

awswrangler.s3.write(  # Storing the data and metadata to Data Lake
        df=df,
        database="database",
        path="s3://...",
        file_format="parquet",
        preserve_index=True,
        mode="overwrite",
        partition_cols=["col"],
    )

Dependencies

AWS Data Wrangler project relies on others great initiatives:

Known Limitations

  • By now only writes in Parquet and CSV file formats
  • By now only reads through AWS Athena
  • By now there are not compression support
  • By now there are not nested type support

Contributing

For almost all features we need rely on AWS Services that didn't have mock tools in the community yet (AWS Glue, AWS Athena). So we are focusing on integration tests instead unit tests.

So, you will need provide a S3 bucket and a Glue/Athena database through environment variables.

export AWSWRANGLER_TEST_BUCKET=...

export AWSWRANGLER_TEST_DATABASE=...

CAUTION: This may this may incur costs in your AWS Account

make init

Make your changes...

make format

make lint

make test

License

This library is licensed under the Apache 2.0 License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awswrangler-0.0b0.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awswrangler-0.0b0-py27,py36,py37-none-any.whl (16.2 kB view details)

Uploaded Python 2.7,py36,py37

File details

Details for the file awswrangler-0.0b0.tar.gz.

File metadata

  • Download URL: awswrangler-0.0b0.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.1

File hashes

Hashes for awswrangler-0.0b0.tar.gz
Algorithm Hash digest
SHA256 874d91d75e2c454ad97d5842a5910ca23856907f35968545d4aa2fe98395e4b2
MD5 16815679ea5efbad008551449b4bbfe2
BLAKE2b-256 c7ac57a7335c0f4e299aff2ec4cfc07d54bab5b7332c4e283f51f2226e174990

See more details on using hashes here.

File details

Details for the file awswrangler-0.0b0-py27,py36,py37-none-any.whl.

File metadata

  • Download URL: awswrangler-0.0b0-py27,py36,py37-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 2.7,py36,py37
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.1

File hashes

Hashes for awswrangler-0.0b0-py27,py36,py37-none-any.whl
Algorithm Hash digest
SHA256 bbd8121577b252344b5b425488ccc336d61f04e6afe798064313a4e1bc237070
MD5 d7e3e1381709e149d139c6ba49623aeb
BLAKE2b-256 c35d9d61a3047128fd0947a483499e07d1cb4770b9c7cecb29966787450102b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page