Skip to main content

Elegant data load from Pandas to Redshift

Project description

redshift_tool - Elegant data load from Pandas to Redshift

1.Overview

redshift_tool is a python package which is prepared for loading pandas data frame into redshift table. This package is making it easier for bulk uploads, where the procedure for uploading data consists in generating various CSV files, uploading them to an S3 bucket and then calling a copy command on the server, this package helps with all those tasks in encapsulated functions. There are two methods of data copy.

a). Append:- It simply copies the data or adds the data at the end of existing data in a redshift table.
b). Upsert:- It is used for updating the old record as per provided upsert Id/Ids and also copy the new records into the table.

redshift_tool is purely implemented in Python.

2.Installation

To install the library, use below command
$ pip install redshift_tool

3.Usages-Guidelines

Uses Commands
>>> import redshift_tool
>>>  redshift_tool.query(data,method,redshift_auth=None,s3_auth=None,schema=None,table=None,
                         primarykey=None,sortkey=None,distkey=None,upsertkey=None)

a). data:- It will take any pandas Data frame.

>>> data= df
b). method: There are two methods of writing pandas data frame as defined above either by ‘append’ or ‘upsert’.
>>> method='append/upsert'
c). redshift_auth:- To write the data into redshift, it is required to establish the redshift connection. It is the connection’s credential parameter.
>>> redshift_auth= {'db':'database_name','port':port,'user':'user','pswd':'password','host':'host'}
d). s3_auth:- AWS S3 is used to enhance the performance of the copy operation. It is used to pass the AWS S3 credentials as well S3 bucket name. S3 Bucket is a place where you can put your files temporary for coping into redshift tables.
>>> s3_auth = {'accesskey':'aws_access_key','secretkey':'aws_secret_key','bucket':'s3_bucket_name'}
e). schema:- If there is any schema name associated with your database in which table will be created.
>>> schema='Schema_name'
f). table:- Target table name to write pandas.
  1. If target table is already exist, function will be used to copy/usert data into exiting table.

  2. A user can proceed with two steps in case of target table is not exist.

2.1. Use SQL Create staatement to create taget table manualy.

2.2. This libaray can also create target table on the basis of input pandas dataframe columns and datatypes so before using the command make sure all the column names and datatypes of pandas dataframe set properly.

>>> table='table_name'
g). primarykey:- A primary key is a special relational database table column (or combination of columns) designated to uniquely identify all table records.

While creating the table by default, if it is required to define any column as primary key, then pass the column name in a tuple in this parameter.

>>> primarykey=('Primary_Key') or
>>> primarykey=('Primary_Key1','Primary_Key2')
h). sortkey:- A sortkey is a field or column that is used to sort the data. It can be a single key as well as multiple keys.

While creating the table by default, if we need to define any column as a sort key, then pass the column name in a tuple in this parameter.

>>> sortkey=('sort_key') or
>>> sortkey=('sort_key1','sort_key2')
i). distkey(Default - Even):- A distribution key is a column that is used to determine the parallel data processing task with all available redshift slices.
>>> distkey=('distribution_key')
j). upsertkey:- During the upsert method of data loading, we need to pass upsert key by which key old record will get updated & new will be added. It will be also added into a tuple.
>>> upsertkey=('upsertkey') or
>>> upsertkey=('upsertkey1','upsertkey2')

4.Examples

Append or Copy data without primarykey, sortkey, distributionkey

Eg.
>>> import redshift_tool
>>> df= pandas.DataFrame()
>>> redshift_tool.query(data=df,method='append',
    redshift_auth={'db':'database_name','port':port,'user':'user','pswd':'password','host':'host'},
    s3_auth={'accesskey':'aws_access_key','secretkey':'aws_secret_key','bucket':'s3_bucket_name'},
    schema='shcema_name',table='redshift_table_name')

Append or Copy data with primarykey, sortkey, distributionkey

Eg.
>>> import redshift_tool
>>> df= pandas.DataFrame()
>>> redshift_tool.query(data=df,method='append',
    redshift_auth={'db':'database_name','port':port,'user':'user','pswd':'password','host':'host'},
    s3_auth={'accesskey':'aws_access_key','secretkey':'aws_secret_key','bucket':'s3_bucket_name'},
    schema='shcema_name',table='redshift_table_name',primarykey=(''primarykey'),
    sortkey=('sortkey'),distkey=('distributionkey'))

Upsert data without primarykey, sortkey, distributionkey

Eg.
>>> import redshift_tool
>>> df= pandas.DataFrame()
>>> redshift_tool.query(data=df,method='append',
    redshift_auth={'db':'database_name','port':port,'user':'user','pswd':'password','host':'host'},
    s3_auth={'accesskey':'aws_access_key','secretkey':'aws_secret_key','bucket':'s3_bucket_name'},
    schema='shcema_name',table='redshift_table_name',upsertkey=('upsertkey'))

Upsert data with primarykey, sortkey, distributionkey

Eg.
>>> import redshift_tool
>>> df= pandas.DataFrame()
>>> redshift_tool.query(data=df,method='append',
    redshift_auth={'db':'database_name','port':port,'user':'user','pswd':'password','host':'host'},
    s3_auth={'accesskey':'aws_access_key','secretkey':'aws_secret_key','bucket':'s3_bucket_name'},
    schema='shcema_name',table='redshift_table_name',primarykey=('primarykey'),
    sortkey=('sortkey'),distkey=('distributionkey'),upsertkey=('upsertkey'))

5.Support

Operating System

Linux/OSX/Windows

Python Version

2/2.7/3/3.2/3.3/3.4/3.5/3.6/3.7 etc.

6.References

Many thanks to the developers of dependent packages. Please use the below links to get deeper knowledge about required packages:-

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redshift_tool-0.4.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

redshift_tool-0.4-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file redshift_tool-0.4.tar.gz.

File metadata

  • Download URL: redshift_tool-0.4.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for redshift_tool-0.4.tar.gz
Algorithm Hash digest
SHA256 115d9b4f73a3f0bd979fb6ea8f06a796ed9cbd03c905295c7bc12becdac7fb83
MD5 ec83b36f1b34d29a3c19a63a2952e022
BLAKE2b-256 3295e2a68463b8a4e6e0cfef02a18ba832cbc3df55afafe022dea8730f0b7306

See more details on using hashes here.

File details

Details for the file redshift_tool-0.4-py3-none-any.whl.

File metadata

  • Download URL: redshift_tool-0.4-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for redshift_tool-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ef70b4d7220f94181ed97800d1ae1369037f9deae110b349c85828de15749e9c
MD5 cb6f4ee8e89783e24ad5615015ab201f
BLAKE2b-256 27b7a5001d029e3dfcb3e1270be1627bdf45f4845ed65e92b22718e6da0dc181

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page