The missing Link between AWS services and the most popular Python data libraries

Project description

AWS Data Wrangler (BETA)

The missing link between AWS services and the most popular Python data libraries.

CAUTION: This project is in BETA version. And was not tested in battle yet.

AWS Data Wrangler aims to fill a gap between AWS Analytics Services (Glue, Athena, EMR, Redshift) and the most popular Python libraries for lightweight workloads.

The rationale behind AWS Data Wrangler is to use the right tool for each job. And this project was developed with the lightweight jobs in mind. That is never so clear and depends of a lot of different factors, but a good rule of thumb that we discoverd during the tests is that if your workload is something around 5 GB in plan text or less, so you should go with AWS Data Wrangler instead of the consagrated big data tools.

Usually there are two different types of use cases when dealing with data, heavy workloads which are dealt better using distributed tools services like EMR/AWS Glue Spark Job and lightweight workloads that can be treated most efficient using simpler tools, and this is when aws data wrangler comes into action.

For example, in AWS Glue you can choose between two different types of Job, distributed with Apache Spark or single node with Python Shell. In this case data wrangler would use the single node with Python Shell job option (Or even AWS Lambda), resulting in less cost and less warm-up time.

Rationale Image

Installation

pip install awswrangler

AWS Data Wrangler runs on Python 2 and 3. And runs on AWS Lambda, AWS Glue, EC2, on-premises and local.

P.S. The Lambda Layer bundle and the Glue egg are available to download. It's just upload to your account and run! :rocket:

Usage

Writing Pandas Dataframe to Data Lake:

awswrangler.s3.write(
        df=df,
        database="database",
        path="s3://...",
        file_format="parquet",
        preserve_index=True,
        mode="overwrite",
        partition_cols=["col"],
    )

If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.

Reading from Data Lake to Pandas Dataframe:

df = awswrangler.athena.read("database", "select * from table")

Reading from "infinite" S3 source to Pandas Dataframe through generators. That can set a maximum chunk size in bytes to fit in any memory size:

for df in awswrangler.s3.read(path="s3://...", max_size=500):
    print(df)

Typical ETL:

import pandas
import awswrangler

df = pandas.read_csv("s3//your_bucket/your_object.csv")  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

awswrangler.s3.write(  # Storing the data and metadata to Data Lake
        df=df,
        database="database",
        path="s3://...",
        file_format="parquet",
        preserve_index=True,
        mode="overwrite",
        partition_cols=["col"],
    )

Dependencies

AWS Data Wrangler project relies on others great initiatives:

Boto3
Pandas
Apache Arrow
Dask s3fs

Known Limitations

By now only writes in Parquet and CSV file formats
By now there are not compression support
By now there are not nested type support

Contributing

For almost all features we need rely on AWS Services that didn't have mock tools in the community yet (AWS Glue, AWS Athena). So we are focusing on integration tests instead unit tests.

So, you will need provide a S3 bucket and a Glue/Athena database through environment variables.

export AWSWRANGLER_TEST_BUCKET=...

export AWSWRANGLER_TEST_DATABASE=...

CAUTION: This may this may incur costs in your AWS Account

make init

Make your changes...

make format

make lint

make test

License

This library is licensed under the Apache 2.0 License.

Project details

Release history Release notifications | RSS feed

3.15.1

Feb 5, 2026

3.15.0

Jan 13, 2026

3.14.0

Oct 30, 2025

3.13.0

Sep 10, 2025

3.12.1

Jun 18, 2025

3.12.0

May 29, 2025

3.11.0

Jan 10, 2025

3.10.1

Dec 4, 2024

3.10.0

Oct 31, 2024

3.9.1

Aug 19, 2024

3.9.0

Jul 8, 2024

3.8.0

Jun 5, 2024

3.7.3

Apr 22, 2024

3.7.2

Mar 27, 2024

3.7.1

Mar 7, 2024

3.7.0

Mar 5, 2024

3.6.0

Feb 14, 2024

3.5.2

Jan 24, 2024

3.5.1

Jan 12, 2024

3.5.0

Jan 11, 2024

3.4.2

Nov 13, 2023

3.4.1

Oct 24, 2023

3.4.0

Sep 11, 2023

3.3.0

Aug 1, 2023

3.2.1

Jun 14, 2023

3.2.0 yanked

Jun 12, 2023

Reason this release was yanked:

Import error when using Windows

3.1.1

May 16, 2023

3.1.0

May 15, 2023

3.0.0

Apr 13, 2023

3.0.0rc3 pre-release

Mar 9, 2023

3.0.0rc2 pre-release

Nov 7, 2022

3.0.0rc1 pre-release

Oct 27, 2022

3.0.0b3 pre-release

Oct 12, 2022

3.0.0b2 pre-release

Sep 30, 2022

3.0.0b1 pre-release

Sep 22, 2022

3.0.0a2 pre-release

Aug 17, 2022

3.0.0a1 pre-release

Jul 19, 2022

2.20.1

Mar 21, 2023

2.20.0

Mar 1, 2023

2.19.0

Jan 9, 2023

2.18.0

Dec 2, 2022

2.17.0

Sep 20, 2022

2.16.1

Jun 28, 2022

2.16.0

Jun 22, 2022

2.15.1

Apr 11, 2022

2.15.0

Mar 28, 2022

2.14.0

Jan 28, 2022

2.13.0

Dec 3, 2021

2.12.1

Oct 18, 2021

2.12.0

Oct 13, 2021

2.11.0

Sep 1, 2021

2.10.0

Jul 21, 2021

2.9.0

Jun 18, 2021

2.8.0

May 19, 2021

2.7.0

Apr 15, 2021

2.6.0

Mar 16, 2021

2.5.0

Mar 3, 2021

2.4.0

Feb 3, 2021

2.3.0

Jan 10, 2021

2.2.0

Dec 22, 2020

2.1.0

Dec 21, 2020

2.0.1

Dec 11, 2020

2.0.0

Dec 7, 2020

1.10.1

Nov 26, 2020

1.10.0

Oct 31, 2020

1.9.6

Oct 10, 2020

1.9.5

Sep 26, 2020

1.9.4

Sep 19, 2020

1.9.3

Sep 8, 2020

1.9.2

Sep 7, 2020

1.9.1

Sep 5, 2020

1.9.0

Sep 1, 2020

1.8.1

Aug 11, 2020

1.8.0

Aug 9, 2020

1.7.0

Jul 30, 2020

1.6.3

Jul 12, 2020

1.6.2

Jul 1, 2020

1.6.1

Jun 26, 2020

1.6.0

Jun 24, 2020

1.5.0

Jun 14, 2020

1.4.0

Jun 2, 2020

1.3.0

May 28, 2020

1.2.0

May 20, 2020

1.1.2

May 8, 2020

1.1.1

May 6, 2020

1.1.0

May 5, 2020

1.0.4

Apr 20, 2020

1.0.3

Apr 15, 2020

1.0.2

Apr 14, 2020

1.0.1

Apr 12, 2020

1.0.0

Apr 10, 2020

0.3.2

Feb 17, 2020

0.3.1

Feb 11, 2020

0.3.0

Feb 4, 2020

0.2.6

Jan 25, 2020

0.2.5

Jan 15, 2020

0.2.4

Jan 14, 2020

0.2.3

Jan 14, 2020

0.2.2

Jan 13, 2020

0.2.1

Jan 9, 2020

0.2.0

Jan 2, 2020

0.1.4

Dec 31, 2019

0.1.3

Dec 20, 2019

0.1.2

Dec 20, 2019

0.1.1

Dec 20, 2019

0.1.0

Dec 13, 2019

0.0.25

Dec 7, 2019

0.0.24

Dec 5, 2019

0.0.23

Nov 23, 2019

0.0.22

Nov 22, 2019

0.0.21

Nov 20, 2019

0.0.20

Nov 16, 2019

0.0.19

Nov 11, 2019

0.0.18

Nov 10, 2019

0.0.17

Oct 31, 2019

0.0.16

Oct 30, 2019

0.0.15

Oct 28, 2019

0.0.14

Oct 26, 2019

0.0.13

Oct 26, 2019

0.0.12

Oct 23, 2019

0.0.11

Oct 21, 2019

0.0.10

Oct 18, 2019

0.0.9

Oct 7, 2019

0.0.8

Oct 3, 2019

0.0.7

Oct 3, 2019

0.0.6

Oct 1, 2019

0.0.5

Sep 28, 2019

0.0.4

Sep 21, 2019

0.0.3

Sep 20, 2019

0.0.2

Sep 17, 2019

0.0.1

Sep 7, 2019

0.0b32 pre-release

Aug 24, 2019

0.0b31 pre-release

Aug 16, 2019

0.0b30 pre-release

Aug 1, 2019

0.0b29 pre-release

Jul 31, 2019

0.0b28 pre-release

Jul 31, 2019

0.0b27 pre-release

Jul 29, 2019

0.0b26 pre-release

Jul 29, 2019

0.0b25 pre-release

Jul 29, 2019

0.0b24 pre-release

Jul 28, 2019

0.0b23 pre-release

Jul 28, 2019

0.0b22 pre-release

Jul 28, 2019

0.0b21 pre-release

Jul 28, 2019

0.0b20 pre-release

Jul 26, 2019

0.0b19 pre-release

Jul 25, 2019

0.0b18 pre-release

Jul 13, 2019

0.0b17 pre-release

Jul 10, 2019

0.0b16 pre-release

Jul 7, 2019

0.0b15 pre-release

Jul 1, 2019

0.0b14 pre-release

Jul 1, 2019

0.0b13 pre-release

Jun 30, 2019

0.0b12 pre-release

Jun 27, 2019

0.0b11 pre-release

Jun 27, 2019

0.0b10 pre-release

Jun 27, 2019

0.0b9 pre-release

Jun 27, 2019

0.0b8 pre-release

Jun 27, 2019

0.0b7 pre-release

Jun 27, 2019

0.0b6 pre-release

Jun 16, 2019

0.0b5 pre-release

Jun 14, 2019

0.0b4 pre-release

Jun 14, 2019

0.0b3 pre-release

Jun 10, 2019

This version

0.0b2 pre-release

Mar 27, 2019

0.0b0 pre-release

Feb 28, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awswrangler-0.0b2.tar.gz (15.5 kB view details)

Uploaded Mar 27, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

awswrangler-0.0b2-py27,py36,py37-none-any.whl (19.2 kB view details)

Uploaded Mar 27, 2019 Python 2.7,py36,py37

File details

Details for the file awswrangler-0.0b2.tar.gz.

File metadata

Download URL: awswrangler-0.0b2.tar.gz
Upload date: Mar 27, 2019
Size: 15.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.1

File hashes

Hashes for awswrangler-0.0b2.tar.gz
Algorithm	Hash digest
SHA256	`0fee0197b42e91886732ceebbb9ed142bac41df5c564fcbd00f53fa956b7170f`
MD5	`3de4d7dbfc8c357ab86abf76d8ec085f`
BLAKE2b-256	`03e951ffc8f4e29a16d94062bbea995378dc7a50bb61b97d401171bfffdef662`

See more details on using hashes here.

File details

Details for the file awswrangler-0.0b2-py27,py36,py37-none-any.whl.

File metadata

Download URL: awswrangler-0.0b2-py27,py36,py37-none-any.whl
Upload date: Mar 27, 2019
Size: 19.2 kB
Tags: Python 2.7,py36,py37
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.1

File hashes

Hashes for awswrangler-0.0b2-py27,py36,py37-none-any.whl
Algorithm	Hash digest
SHA256	`47ed59a0881f6c8c899954de1ce97c5ec43f6f735ea9177b96e8be6ab13bf642`
MD5	`f2271becdb4f1f6f1d8ddb36ae663798`
BLAKE2b-256	`597335e38cfb9ef9c9833ea7ce2e5042b3200cff9dc888e01dbfa63552de1f7c`

See more details on using hashes here.

awswrangler 0.0b2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

AWS Data Wrangler (BETA)

CAUTION: This project is in BETA version. And was not tested in battle yet.

Installation

Usage

Writing Pandas Dataframe to Data Lake:

Reading from Data Lake to Pandas Dataframe:

Reading from "infinite" S3 source to Pandas Dataframe through generators. That can set a maximum chunk size in bytes to fit in any memory size:

Typical ETL:

Dependencies

Known Limitations

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes