Skip to main content

Productivity for your AWS Data Lake

Project description

DEPRECATION: This project was only created for a Proof of Concept purpose. The production ready version of this project received the name of AWS Data Wrangler (pip install awswrangler).

Please consider move forward to:


Code style: black

PandasGlue

A Python library for creating lite ETLs with the widely used Pandas library and the power of AWS Glue Catalog.

With PandasGLue you will be able to write/read to/from an AWS Data Lake with one single line of code. With its minimalist nature PandasGLue has an interface with only 2 functions:

function From To
write_glue() Pandas DataFrame AWS Glue Table
read_glue() AWS GlueTable Pandas DataFrame

Once your data is mapped to AWS Glue Catalog it will be accessible to many other tools like AWS Redshift Spectrum, AWS Athena, AWS Glue Jobs, AWS EMR (Spark, Hive, PrestoDB), etc.

Amazon Glue is an AWS simple, flexible, and cost-effective ETL service and Pandas is a Python library which provides high-performance, easy-to-use data structures and data analysis tools.

The goal of this package is help data engineers in the usage of cost efficient serverless compute services (Lambda, Glue, Athena) in order to provide an easy way to integrate Pandas with AWS Glue, allowing load (appending, overwriting or only overwriting the partitions with data) the content of a DataFrame (Write function) directly in a table (parquet/csv format) in the Glue Data Catalog and also execute Athena queries (Read function) returning the result directly in a Pandas DataFrame.

Use cases

This package is recommended for ETL purposes which loads and transforms small to medium size datasets without requiring to create Spark jobs, helping reduce infrastructure costs.

It could be used within Lambda functions, Glue scripts, EC2 instances or any other infrastucture resources.

Prerequisites

pip install pandas
pip install boto3
pip install pyarrow 

Installing the package...

pip install pandasglue

Or you can download direct the artifacts for AWS Lambda Layer / AWS Glue Job from our release page. Then you only will need to upload it in your AWS account.

Usage

Read method:

read_glue()

To retrieve the result of an Athena Query in a Pandas DataFrame.

Quick example:

import pandas as pd
import pandasglue as pg

#Parameters
sql_query = "SELECT * FROM table_name LIMIT 20" 
db_name = "DB_NAME"
s3_output_bucket = "s3://bucket-url/"

df = pg.read_glue(sql_query,db_name,s3_output_bucket)

print(df)

Parameters list:

  • query: the SQL statement on Athena
  • db: database name.
  • s3_output: path of the S3 output folder (optional)
  • region: id of the AWS region, e.g: us-west-1 (optional)
  • key: AWS Access key (optional)
  • secret: AWS secret key (optional)
  • profile_name: AWS IAM profile (optional)


Write method:

write_glue()

Convert a given Pandas Dataframe to a Glue Parquet table.

Quick example:

import pandas as pd
import pandasglue as pg

#Parameters
database = "DB_NAME"
table_name = "TB_NAME"
s3_path = "s3://bucket-url/"

#Sample DF
source_data = {'name': ['Sarah', 'Renata', 'Erika', 'Fernanda', 'Diana'], 
        'city': ['Seattle', 'Sao Paulo', 'Seattle', 'Santiago', 'Lima'],
         'test_score': [82, 52, 56, 234, 254]}

df = pd.DataFrame(source_data, columns = ['name', 'city', 'test_score'])


pg.write_glue(df, database, table_name, s3_path, partition_cols=['city'])

Parameters list:

  • df: variable containing the Pandas DataFrame
  • database: database name.
  • path: Path of the target S3 bucket
  • table: table name (optional)
  • partition_cols: list of columns to partition (optional)
  • preserve_index: boolean, if you want to preserve the index on the table (optional)
  • file_format: parquet|csv(optional)
  • mode: append|overwrite|overwrite_partitions(optional)
  • region: ID of the AWS region (optional)
  • key: AWS Access key (optional)
  • secret: AWS secret key (optional)
  • profile_name: AWS IAM profile (optional)

Built With

  • Boto3 - (AWS) SDK for Python, which allows Python developers to write software that makes use of Amazon services like S3 and EC2.
  • PyArrow - Python package to interoperate Arrow with Python allowing to convert text files format to parquet files among other functions.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandasglue-0.0.3.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

pandasglue-0.0.3-py27,py36,py37-none-any.whl (17.3 kB view details)

Uploaded Python 2.7,py36,py37

File details

Details for the file pandasglue-0.0.3.tar.gz.

File metadata

  • Download URL: pandasglue-0.0.3.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for pandasglue-0.0.3.tar.gz
Algorithm Hash digest
SHA256 eeafd88b193e87db8f68a4c9a7a06afa7ef8aac0b20875ec2dc34064d760a668
MD5 77183eb067a035b81c0062cb173921d8
BLAKE2b-256 1e8c41880fa896557d59e09dd1756c09e6e2794e7050e2b00176ab660fd54784

See more details on using hashes here.

File details

Details for the file pandasglue-0.0.3-py27,py36,py37-none-any.whl.

File metadata

  • Download URL: pandasglue-0.0.3-py27,py36,py37-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 2.7,py36,py37
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for pandasglue-0.0.3-py27,py36,py37-none-any.whl
Algorithm Hash digest
SHA256 bfbbfeaa1b1d4e423f85ba44a6d79d40e0e160177d2cd12c08a7a64a8c2b37d0
MD5 b8b692882e6edbd94bc143d6872d70ff
BLAKE2b-256 ae47ace32b6ec156965d3b70ff0bc8b4137146b04460bc1ae6f769e2e414641f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page