Skip to main content

A simple service-oriented ETL framework for integrations

Project description

judah

She (Leah) said, “This time I will praise the LORD”; so she named him Judah - Genesis 29: 35

judah is a service-oriented Python package to handle ETL (extract-transform-load) tasks easily.

It follows a service-oriented architectural (SOA) design.

Under the hood, it uses the nice little ETL framework called Bonobo under the hood.

This project is still under heavy development

Purpose

The judah framework was created to standardize the integration or ETL (Extract-transform-load) applications that collect energy data from multiple external sources and saves it in a warehouse.

Links

Here are a few important links:

Languages Used

Dependencies

Getting Started

  • Install the package
pip install judah
  • Copy the .example.env file to .env and make appropriate edits on it
cp .example.env .env
  • Import the source, destination and transformer classes, as well as any utility functions you may like and use accordingly
from judah.sources.export_site.date_based import DateBasedExportSiteSource
# ...  

Expected App System Design and Architecture

The judah framework expects all applications that use it to follow a service-oriented-architecture as shown below.

  • The app should have a services folder (or in python, what we call package) to contain the separate ETL services, each corresponding to a given third-party data source e.g. CNN, BBC
  • Subsequently, each ETL service should be divided up into child services. Each child service should represent a unique data flow path e.g. REST-API-to-database, REST-API-to-cache, REST-API-to-queue, file-download-site-to-database, file-download-site-to-queue etc.
  • Each child service should be divided up into a number of microservices. Each microservice should correspond to a single dataset, e.g. 'available_capacity', 'installed_capacity' etc.
  • Each microservice is expected to have a destination folder, a source.py file, a controller.py file and a transformers.py file.
    • The destination folder contains the database model file to which the data is to be saved. It contains a child class of the DatabaseBaseModel class of the judah framework
    • The source.py file contains a child class of the BaseSource class of the judah framework. This is the class responsible for connecting to the data source (e.g. the REST API) and downloading the data from there.
    • The transfomers.py file contains child classes of the BaseTransformer class of the judah framework. They are responsible for transforming the source data into the data that can be saved. This may involve changing field names and types, exploading the data etc.
    • The controller.py file contains child class of the BaseController class of the judah framework. This class is responsible for controlling the data flow from the source class, through the transformers, to the destination model.
  • Each child service foldershould contain a registry of these microservices in its __init__.py file. The registry is just a list of the controllers of the microservices.
  • The app should have a main.py file as the entry point of the app where the Bonobo graph is instantiated and the microservice registries mentioned in the point above are added to the graph. Look at the example_main.py file for inspiration.

Why service-oriented architectural (SOA) design

Service oriented architecture makes it easy to connect actual feature requests with the actual code that is written. Many a time, software requirements are structured in typically a service-oriented manner. For example.

  • User can see realtime data about bitcoin
  • User can see realtime data about Ethereum
  • User can view historical data about bitcoin

When we have source code that follows the exact manner these requirements are laid out, it is easy to comprehend for anyone really.

For example, for the above example, each of those requirements will have a single pipeline, each having its own independent folder.

It is even easy to transfer that architecture into a stable microservice architecture if there is ever need to do so.

Watch this talk by Alexandra Noonan and this other one by Simon Brown

How to set up Debian server for Selenium Chrome driver

  • Install an in-memory display server (xvfb)
sudo apt-get update
sudo apt-get install -y curl unzip xvfb libxi6 libgconf-2-4
  • Install Google Chrome
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo echo "deb [arch=amd64]  http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

How to test

git clone https://github.com/sopherapps/judah.git && cd judah
  • Copy the .example.env file to .env and make appropriate edits on it
cp .example.env .env
  • Create the test database: 'test_judah' in this case
sudo -su postgres
createdb test_judah
  • Update the TEST_POSTGRES_DB_URI variable in the .env file to that test database's connection details

  • Create a virtual environment and activate it

virtualenv -p /usr/bin/python3.6 env && source env/bin/activate
  • Install the dependencies
pip install -r requirements.txt
  • Run the test command
python -m unittest
  • To view test coverage and then report the results
coverage run -m unittest && coverage report -m

How to Use (Example commands for Linux)

  • Ensure you have Google Chrome installed. For debian servers, see instructions under the title "How to set up Debian server for Selenium Chrome driver"

Maintainers

Folder Structure

The judah package holds the framework components that are basically base classes to be overridden.

The folder structure as generated by th command tree -d --matchdirs -I 'env|__pycache__' is as shown below

.
├── judah
│   ├── controllers
│   │   ├── base
│   │   ├── db_to_db
│   │   ├── export_site_to_db
│   │   └── rest_api_to_db
│   ├── destinations
│   │   └── database
│   ├── sources
│   │   ├── base
│   │   ├── database
│   │   ├── export_site
│   │   └── rest_api
│   ├── transformers
│   └── utils
└── test
    ├── assets
    ├── test_controllers
    ├── test_destinations
    │   └── test_database
    ├── test_sources
    │   ├── test_database
    │   ├── test_exports_site
    │   └── test_rest_api
    ├── test_transformers
    └── test_utils

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judah-0.0.6.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

judah-0.0.6-py3-none-any.whl (76.8 kB view details)

Uploaded Python 3

File details

Details for the file judah-0.0.6.tar.gz.

File metadata

  • Download URL: judah-0.0.6.tar.gz
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.6.9

File hashes

Hashes for judah-0.0.6.tar.gz
Algorithm Hash digest
SHA256 9931582a08e678064d8e23c297a7d9e38285cfa8cef5625b7c0472021b5819da
MD5 5602fd570984a7643c48c94bb6972707
BLAKE2b-256 d15879c23315c7cf7383656bff60ef5caa8a9db01be89f7dec9673426d4b5011

See more details on using hashes here.

File details

Details for the file judah-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: judah-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 76.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.6.9

File hashes

Hashes for judah-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 24841b36154feac7e9a5c500604be54775f8c9dc48ca76a0ceccaadc0df15fe9
MD5 45210718e857aa5d5912a2d3afac7c19
BLAKE2b-256 f990ce3b0ecd217f3c59262bcd5f462b53effe182eb707ff74f46067b6355758

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page