DataFrames on AWS.
Project description
DataFrames on AWS
Read the Docs
Resources
Use Cases
PySpark
FROM | TO | Features |
---|---|---|
PySpark DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenesAppend/Overwrite/Upsert modes |
PySpark DataFrame | Glue Catalog | Register Parquet or CSV DataFrame on Glue Catalog |
Nested PySpark DataFrame |
Flat PySpark DataFrames |
Flatten structs and break up arrays in child tables |
Pandas
FROM | TO | Features |
---|---|---|
Pandas DataFrame | Amazon S3 | Parquet, CSV, Partitions, Parallelism, Overwrite/Append/Partitions-Upsert modes, KMS Encryption, Glue Metadata (Athena, Spectrum, Spark, Hive, Presto) |
Amazon S3 | Pandas DataFrame | Parquet (Pushdown filters), CSV, Fixed-width formatted, Partitions, Parallelism, KMS Encryption, Multiple files |
Amazon Athena | Pandas DataFrame | Workgroups, S3 output path, Encryption, and two different engines: - ctas_approach=False -> Batching and restrict memory environments - ctas_approach=True -> Blazing fast, parallelism and enhanced data types |
Pandas DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenes Append/Overwrite/Upsert modes |
Amazon Redshift | Pandas DataFrame | Blazing fast using parallel parquet on S3 behind the scenes |
Pandas DataFrame | Amazon Aurora | Supported engines: MySQL, PostgreSQL Blazing fast using parallel CSV on S3 behind the scenes Append/Overwrite modes |
Amazon Aurora | Pandas DataFrame | Supported engines: MySQL Blazing fast using parallel CSV on S3 behind the scenes |
CloudWatch Logs Insights | Pandas DataFrame | Query results |
Glue Catalog | Pandas DataFrame | List and get Tables details. Good fit with Jupyter Notebooks. |
General
Feature | Details |
---|---|
List S3 objects | e.g. wr.s3.list_objects("s3://...") |
Delete S3 objects | Parallel |
Delete listed S3 objects | Parallel |
Delete NOT listed S3 objects | Parallel |
Copy listed S3 objects | Parallel |
Get the size of S3 objects | Parallel |
Get CloudWatch Logs Insights query results | |
Load partitions on Athena/Glue table | Through "MSCK REPAIR TABLE" |
Create EMR cluster | "For humans" |
Terminate EMR cluster | "For humans" |
Get EMR cluster state | "For humans" |
Submit EMR step(s) | "For humans" |
Get EMR step state | "For humans" |
Query Athena to receive python primitives | Returns Iterable[Dict[str, Any] |
Load and Unzip SageMaker jobs outputs | |
Dump Amazon Redshift as Parquet files on S3 | |
Dump Amazon Aurora as CSV files on S3 | Only for MySQL engine |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
awswrangler-0.3.1.tar.gz
(62.2 kB
view hashes)
Built Distributions
awswrangler-0.3.1-py3.7.egg
(154.2 kB
view hashes)
Close
Hashes for awswrangler-0.3.1-glue-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 663b03e3eeff5abb25212f24c5a10cbc3e2b2d7354407d44781b51e3d8c9f2c1 |
|
MD5 | 0beecc88b73fd39cf2f6739e05530778 |
|
BLAKE2b-256 | 6281550072d74ed2c98287e0fbb8df5901a0d5c71e6f8e99d6b1f97089482da6 |