Pandas on AWS.
Project description
AWS Data Wrangler
Pandas on AWS
An AWS Professional Service open source initiative | aws-proserve-opensource@amazon.com
Source | Downloads | Installation Command |
---|---|---|
PyPi | pip install awswrangler |
|
Conda | conda install -c conda-forge awswrangler |
Table of contents
- Quick Start
- Read The Docs
- Community Resources
- Logging
- Who uses AWS Data Wrangler?
- Amazon SageMaker Data Wrangler?
Quick Start
Installation command: pip install awswrangler
import awswrangler as wr
import pandas as pd
from datetime import datetime
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")
# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum
con = wr.redshift.connect("my-glue-connection")
df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con)
con.close()
# Amazon Timestream Write
df = pd.DataFrame({
"time": [datetime.now(), datetime.now()],
"my_dimension": ["foo", "boo"],
"measure": [1.0, 1.1],
})
rejected_records = wr.timestream.write(df,
database="sampleDB",
table="sampleTable",
time_col="time",
measure_col="measure",
dimensions_cols=["my_dimension"],
)
# Amazon Timestream Query
wr.timestream.query("""
SELECT time, measure_value::double, my_dimension
FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3
""")
Read The Docs
- What is AWS Data Wrangler?
- Install
- Tutorials
- 001 - Introduction
- 002 - Sessions
- 003 - Amazon S3
- 004 - Parquet Datasets
- 005 - Glue Catalog
- 006 - Amazon Athena
- 007 - Databases (Redshift, MySQL and PostgreSQL)
- 008 - Redshift - Copy & Unload.ipynb
- 009 - Redshift - Append, Overwrite and Upsert
- 010 - Parquet Crawler
- 011 - CSV Datasets
- 012 - CSV Crawler
- 013 - Merging Datasets on S3
- 014 - Schema Evolution
- 015 - EMR
- 016 - EMR & Docker
- 017 - Partition Projection
- 018 - QuickSight
- 019 - Athena Cache
- 020 - Spark Table Interoperability
- 021 - Global Configurations
- 022 - Writing Partitions Concurrently
- 023 - Flexible Partitions Filter
- 024 - Athena Query Metadata
- 025 - Redshift - Loading Parquet files with Spectrum
- 026 - Amazon Timestream
- 027 - Amazon Timestream 2
- API Reference
- License
- Contributing
- Legacy Docs (pre-1.0.0)
Community Resources
Please send a Pull Request with your resource reference and @githubhandle.
- Optimize Python ETL by extending Pandas with AWS Data Wrangler [@igorborgest]
- Reading Parquet Files With AWS Lambda [@anand086]
- Transform AWS CloudTrail data using AWS Data Wrangler [@anand086]
- Rename Glue Tables using AWS Data Wrangler [@anand086]
- Getting started on AWS Data Wrangler and Athena [@dheerajsharma21]
- Simplifying Pandas integration with AWS data related services [@bvsubhash]
Logging
Enabling internal logging examples:
import logging
logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")
logging.getLogger("awswrangler").setLevel(logging.DEBUG)
logging.getLogger("botocore.credentials").setLevel(logging.CRITICAL)
Into AWS lambda:
import logging
logging.getLogger("awswrangler").setLevel(logging.DEBUG)
Who uses AWS Data Wrangler?
Knowing which companies are using this library is important to help prioritize the project internally.
Please send a Pull Request with your company name and @githubhandle if you may.
- Amazon
- AWS
- Cepsa [@alvaropc]
- Cognitivo [@msantino]
- Digio [@afonsomy]
- DNX [@DNXLabs]
- Funcional Health Tech [@webysther]
- Informa Markets [@mateusmorato]
- LINE TV [@bryanyang0528]
- M4U [@Thiago-Dantas]
- nrd.io [@mrtns]
- OKRA Technologies [@JPFrancoia, @schot]
- Pier [@flaviomax]
- Pismo [@msantino]
- ringDNA [@msropp]
- Serasa Experian [@andre-marcos-perez]
- Shipwell [@zacharycarter]
- strongDM [@mrtns]
- Thinkbumblebee [@dheerajsharma21]
- Zillow [@nicholas-miles]
Amazon SageMaker Data Wrangler?
Amazon SageMaker Data Wrangler is a new SageMaker Studio feature that has a similar name but has a different purpose than the AWS Data Wrangler open source project.
-
AWS Data Wrangler is open source, runs anywhere, and is focused on code.
-
Amazon SageMaker Data Wrangler is specific for the SageMaker Studio environment and is focused on a visual interface.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for awswrangler-2.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63a4bc53229234c765112f7929d1aa30cb90979d6ae9289470e5fb3172741e30 |
|
MD5 | b324f3af4e9b7489190bf33e5ffdc01a |
|
BLAKE2b-256 | 319c38155099ea1041f20f812a0f50a8ac7603d27448b81b00f01bfc5d5f9f4c |