dattasa

Python wrapper for connecting to postgres/greenplum, mysql, mongodb, kafka, redis, mixpanel and salesforce. Also included are modules for performing secure file transfer and sourcing environment variables.

These details have not been verified by PyPI

Project links

Homepage

Project description

python package that helps data engineers and data scientists accelerate data-pipeline development

The goal of this python project is to build a bunch of wrappers that can be reused for building data pipelines from - - Relational databases: postgres, mysql, greenplum, redshift, etc. - NOSQL databases: hive, mongo, etc. - messaging sources and caches: kafka, redis, rabbitmq, etc. - cloud service providers: salesforce, mixpanel, jira, google-drive, delighted, wootric, etc.

Installation

There are 3 ways to install dattasa package -

Easiest way is to install from pypi using pip

pip install dattasa

Download from github and build from scratch

git clone git@github.com:kartikra/dattasa.git
cd dattasa
python setup.py build
python setup.py clean
python setup.py install

Download from github and install using pip

git clone git@github.com:kartikra/dattasa.git
cd dattasa
pip install -e .
pip install -U -e . (if upgrading)

Config Files

By default dattasa expects the config files to be in the mode directory of user. These can be overridden. See links to sample code in README file below to find out more. There are 2 yaml config files - database.yaml - conists of database credentials and api keys needed for making connection. see sample database config - ftpsites.yaml - Needed for performing sftp transfers. see sample ftpsites config

Environment Variables

dattasa package relies on the following environment variables. Make sure to set these in your bash profile - GPLOAD_HOME: Path to gpload package (needed only if using gpload utilities for greenplum or redshift) - PROJECT_HOME: Path to python project directory - PROJECT_HOME/python_bash_scripts: python scripts to invoke gpload (needed only if using gpload utilities for greenplum or redshift) - SQL_DIR: Place to keep all sql scripts - TEMP_DIR: All temp files created in this folder - LOG_DIR: All log files are created in this folder

Description of classes

v1.0 of the package comprises of the following classes. Please see link to sample code for details on how to use each of them.

class	Description	Sample Code
environment	Lets you source all the os environment variables	see first row in mongo example
postgres_clien t	Lets you use psql and gpload utilities provided by pivotal greenplum. Make connections to postgres / greenplum database using pyscopg2 or sqlalchemy.Use the connections to interact with database in interactive program or run queries from a sql file using the connection	sample postgres code
greenplum_clie nt (inherits postgres_clien t)	Lets you use psql and gpload utilities provided by pivotal greenplum. Make connections to postgres / greenplum database using pyscopg2 or sqlalchemy.Use the connections to interact with database in interactive program or run queries from a sql file using the connection	sample greenplum code
mysql_client	Lets you use mysql and other methods provided by PyMySQL Package	sample mysql code
file_processor	Create sftp connection using paramiko package. Other file manipulations like row_count, encryption, archive (File Class)	see file processing example
notification	Send email notifications
mongo_client	Load data to mongodb using bulk load. Run java script queries	see mongo example
redis_client	Read data from a redis cache or load a redis cache	see redis example
kafka_system	Currently allows Publisher and Consumer to use kafka in batch mode	see kafka example
rabbitmq_syste m	Currently has Publisher to publish messages in rabbitmq
mixpanel_clien t	Connect to mixpanel api and fetch data using jql or export raw events data. mixpanel api documentation	see mixpnael section in api example
salesforce_cli ent	Create a connection to salesforce using simple_salesforce package	see salesforce section in api example
delighted_clie nt	Get nps scores and survey responses from delighted.api documentation	see delighted section in api example
wootric_client	Gets nps scores and survey responses from wootric.api documentation	see wootric section in api example
dag_controller	Functions needed to integrate this package within an airflow dag. airflow documentation and github project

data_pipeline class

This is the main class that’s accessible to other projects. The data pipeline consists of data from components and API. Each object of data-processor can use individual data streams and process them data_pipeline decides which modules to call based on type of database (as defined in config file). data_pipeline comprises of 3 classes - DataComponent : Each database connection is considered to be data-component object.See examples for postgres, mysql, greenplum, etc above - APICall : Each api call is an apicall object. See examples for mixpanel, delighted, salesforce and wootric above - DataProcessor : transfers and loads data between data components. see examples

Adding ipython notebook files to github

Use git lfs See documentation

if using mac install git-lfs using brew brew install git-lfs
install lfs git lfs install
track ipynb files in your project. go to the project folder and do git lfs track "*.psd"
add .*ipynb_checkpoints/ to .gitignore file
Finally add .gitattributes file git add .gitatttributes

Deploying code in pypi

build the code: python setup.py build && python setup.py clean && python setup.py install
push to pypitest : python setup.py sdist upload -r pypitest
push to pypi prod : python setup.py sdist upload -r pypi

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.1

Feb 11, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dattasa-1.1.tar.gz (29.5 kB view details)

Uploaded Feb 11, 2018 Source

File details

Details for the file dattasa-1.1.tar.gz.

File metadata

Download URL: dattasa-1.1.tar.gz
Upload date: Feb 11, 2018
Size: 29.5 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for dattasa-1.1.tar.gz
Algorithm	Hash digest
SHA256	`2ca8d0b5c3156c3d46b249eaefb18024e8da958c452f51497bdbe719a2efb9c9`
MD5	`cea52ebbddd232bda6bb7024660764b0`
BLAKE2b-256	`718273224de19e0397d973eec1ab20c3346b95783ffe41e2d8a4c74a88e94970`

See more details on using hashes here.

dattasa 1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

python package that helps data engineers and data scientists accelerate data-pipeline development

Installation

There are 3 ways to install dattasa package -

Config Files

Environment Variables

Description of classes

data_pipeline class

Adding ipython notebook files to github

Deploying code in pypi

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes