A simple service-oriented ETL framework for integrations
Project description
judah
She (Leah) said, “This time I will praise the LORD”; so she named him Judah - Genesis 29: 35
judah is a service-oriented Python package to handle ETL (extract-transform-load) tasks easily.
It follows a service-oriented architectural (SOA) design.
Under the hood, it uses the nice little ETL framework called Bonobo under the hood.
This project is still under heavy development
Purpose
The judah framework was created to standardize the integration or ETL (Extract-transform-load) applications that collect energy data from multiple external sources and saves it in a warehouse.
Links
Here are a few important links:
Languages Used
Dependencies
- Python3.6 (attempting to use > 3.6 may cause weird errors)
- Bonobo ETL
- SqlAlchemy
- Selenium
- requests
- xml-stream
- webdriver-manager
- xlrd
- python-dotenv
- pydantic
- email-notifier
Getting Started
- Install the package
pip install judah
- Copy the
.example.env
file to.env
and make appropriate edits on it
cp .example.env .env
- Import the source, destination and transformer classes, as well as any utility functions you may like and use accordingly
from judah.sources.export_site.date_based import DateBasedExportSiteSource
# ...
Expected App System Design and Architecture
The judah framework expects all applications that use it to follow a service-oriented-architecture as shown below.
- The app should have a
services
folder (or in python, what we call package) to contain the separate ETL services, each corresponding to a given third-party data source e.g. CNN, BBC - Subsequently, each ETL service should be divided up into child services. Each child service should represent a unique data flow path e.g. REST-API-to-database, REST-API-to-cache, REST-API-to-queue, file-download-site-to-database, file-download-site-to-queue etc.
- Each child service should be divided up into a number of microservices. Each microservice should correspond to a single dataset, e.g. 'available_capacity', 'installed_capacity' etc.
- Each microservice is expected to have a
destination
folder, asource.py
file, acontroller.py
file and atransformers.py
file.- The
destination
folder contains the database model file to which the data is to be saved. It contains a child class of the DatabaseBaseModel class of the judah framework - The
source.py
file contains a child class of the BaseSource class of the judah framework. This is the class responsible for connecting to the data source (e.g. the REST API) and downloading the data from there. - The
transfomers.py
file contains child classes of the BaseTransformer class of the judah framework. They are responsible for transforming the source data into the data that can be saved. This may involve changing field names and types, exploading the data etc. - The
controller.py
file contains child class of the BaseController class of the judah framework. This class is responsible for controlling the data flow from the source class, through the transformers, to the destination model.
- The
- Each child service foldershould contain a registry of these microservices in its
__init__.py
file. The registry is just a list of the controllers of the microservices. - The app should have a
main.py
file as the entry point of the app where the Bonobo graph is instantiated and the microservice registries mentioned in the point above are added to the graph. Look at theexample_main.py
file for inspiration.
Why service-oriented architectural (SOA) design
Service oriented architecture makes it easy to connect actual feature requests with the actual code that is written. Many a time, software requirements are structured in typically a service-oriented manner. For example.
- User can see realtime data about bitcoin
- User can see realtime data about Ethereum
- User can view historical data about bitcoin
When we have source code that follows the exact manner these requirements are laid out, it is easy to comprehend for anyone really.
For example, for the above example, each of those requirements will have a single pipeline, each having its own independent folder.
It is even easy to transfer that architecture into a stable microservice architecture if there is ever need to do so.
Watch this talk by Alexandra Noonan and this other one by Simon Brown
How to set up Debian server for Selenium Chrome driver
- Install an in-memory display server (xvfb)
sudo apt-get update
sudo apt-get install -y curl unzip xvfb libxi6 libgconf-2-4
- Install Google Chrome
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
How to test
-
Install PostgreSQL +9.5 server if you haven't already. Here are the instructions
-
Clone the repo and enter its root folder
git clone https://github.com/sopherapps/judah.git && cd judah
- Copy the
.example.env
file to.env
and make appropriate edits on it
cp .example.env .env
- Create the test database: 'test_judah' in this case
sudo -su postgres
createdb test_judah
-
Update the
TEST_POSTGRES_DB_URI
variable in the.env
file to that test database's connection details -
Create a virtual environment and activate it
virtualenv -p /usr/bin/python3.6 env && source env/bin/activate
- Install the dependencies
pip install -r requirements.txt
- Run the test command
python -m unittest
- To view test coverage and then report the results
coverage run -m unittest && coverage report -m
How to Use (Example commands for Linux)
- Ensure you have Google Chrome installed. For debian servers, see instructions under the title "How to set up Debian server for Selenium Chrome driver"
Maintainers
Folder Structure
The judah
package holds the framework components that are basically base classes to be overridden.
The folder structure as generated by th command tree -d --matchdirs -I 'env|__pycache__'
is as shown below
.
├── judah
│ ├── controllers
│ │ ├── base
│ │ ├── db_to_db
│ │ ├── export_site_to_db
│ │ └── rest_api_to_db
│ ├── destinations
│ │ └── database
│ ├── sources
│ │ ├── base
│ │ ├── database
│ │ ├── export_site
│ │ └── rest_api
│ ├── transformers
│ └── utils
└── test
├── assets
├── test_controllers
├── test_destinations
│ └── test_database
├── test_sources
│ ├── test_database
│ ├── test_exports_site
│ └── test_rest_api
├── test_transformers
└── test_utils
Acknowledgements
- The tutorial How to Setup Selenium with ChromeDriver on Debian 10/9/8 was very useful when deploying the app on a Debian server
- The RealPython tutorial on publishing python packages was very helpful.
- The Stackoverflow question about a wheel error was very helpful.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file judah-0.0.6.tar.gz
.
File metadata
- Download URL: judah-0.0.6.tar.gz
- Upload date:
- Size: 48.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9931582a08e678064d8e23c297a7d9e38285cfa8cef5625b7c0472021b5819da |
|
MD5 | 5602fd570984a7643c48c94bb6972707 |
|
BLAKE2b-256 | d15879c23315c7cf7383656bff60ef5caa8a9db01be89f7dec9673426d4b5011 |
File details
Details for the file judah-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: judah-0.0.6-py3-none-any.whl
- Upload date:
- Size: 76.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24841b36154feac7e9a5c500604be54775f8c9dc48ca76a0ceccaadc0df15fe9 |
|
MD5 | 45210718e857aa5d5912a2d3afac7c19 |
|
BLAKE2b-256 | f990ce3b0ecd217f3c59262bcd5f462b53effe182eb707ff74f46067b6355758 |