Skip to main content

Data modeling managing and transforming data

Project description

PyPI version Docs badge Repository License

Ding Dong data preparation ETL

dingDong is a metadata management infrastructure for data preparation, ETL, and machine learning models with a complete version control support for all data preparation, loading, and ML experiments.

using dingDONG propagation mechanism help to design, develop, maintain and scale complex batch and real-time data projects in a fraction of time

See Ding-Dong documentation for install and developer documentation. More information can be found at http://www.biSkilled.com

Why Ding Dong?

Simple, because it is fast and easy to implement !

Steps for creating at data project:

  • Source to target modeling - mostly done by excel or external ERD mapping tools

  • Implementing - done mainly by external ETL tools

  • Improvements, extensions, maintenance and deploy between environments - usually an iterative process which consumes time and effort. meta-data is constantly changing. if business logic involves ML/DL models - versioning must be set for trucking overall results accuracy

dingDONG developed to maintain and speed the overall data project implementation

  • Source to target modeling - dingDONG truck meta-data changes such as adding, removing, updating DB columns. converting data-type from diverse connections or adding key sequence is done automatically. Modeling can be managed in a simple excel file, dindDONG will truck excel updates and will update all object by using a propagation mechanism.

  • Implementing E(T)L (extract, transform, load) - dingDONG multi-threading and data fetching optimization give a powerful tool for massive bulk data loading

  • Transform - It is possible to add calculated columns or manipulate existing column by adding python functions. A function such as scaling or normalization can be added as easy as adding the current date or concatenate functions

  • Business logic - Maintaining business logic in dingDONG is used directly by using SQL files that can be executed directly at the development IDE. Using dingDONG comments syntax is used to extract and execute complex queries from SQL files which allows managing all business logic in separate files.

  • Version control - dingDONG version control and management keeps track of all metadata changes and stored it in a proprietary dingDONG DB (MongoDB). dingDONG supports GIT version-control mechanism for managing ML/DL projects, versioning and tracking experiments history results. There is full support in GITHUB for distributed versioning.

dingDONG modules:

  • DING - create and manage overall meta-data structure for all listed object in the work-flow
    • creating new objects

    • modify an existing object by using a back-propagation mechanism

    • update data into new object

    • save old structures

    • CI source control

  • DONG - extract and load data from diverse connectors
    • extract data - support multi-threading for extracting massive data volume

    • transfer - add a custom function on existing columns or new custom calculated fields

    • merge - merging source with the target object (supported only for objects at the same connector)

    • exec - execute PL/SQL/Stored procedure commands

  • VERSION CONTROL - track meta-data and code updates
    • Versions numbers are managed automatically (sample: <major-release>.<minor-release>.<fixId>)

    • Any change at object metadata is automatically stored at dingDONG repository

    • Support managed GIT versioning of ML / DL experiments.
      • Code update - a new version is created

      • Experiment results stored in dingDong repository for future analytics

      • Roll revisions history by version number

      • Define measures such as counting total input or total output rows

      • Store executions result for compare experiments performance

dingDONG architecture

Installation

download from GitHub or install by using pip:

pip install dingDong

Adding MongoDB for pip installation is now cooked. Installation instruction will come soon Docker support - as well, still at the oven .. will come soon

Samples

Samples can be found under Ding-Dong sample documentation

The sample below loads 3 CSV files into SQLite. (we are using SQL to merge this CSV and extract data into a report table and CSV file).

Download ZIP file with 3 samples CSV files DATAELEMENTDESCRIPTION.csv, DEMOGRAPHICS.csv, MEASURESOFBIRTHANDDEATH.csv located at samples/sampleHealthCare/csvData.zip folder. In this sample, we use C:\dingDONG as our main folder for all source CSV files and dingDong logs.

Full code sample extractCsvToSqlLite.py located at samples/sampleHealthCare/ folder

The sample demonstrates how to load 3 CSV files into SqlLite, create a simple query-based on those tables and send the result to a new CSV file.

configuration properties can be found at dingDONG documentation

import logging
from dingDong import DingDong
from dingDong import Config

""" Main configuration """

""" set log level: logging.INFO, logging.DEBUG, logging.ERROR """
Config.LOGS_DEBUG = logging.DEBUG

"""
    set connection URL into all connectors
    key   -> general connection name or connection type (sql, oracle, file .. )
    value -> can be string or dictionary
        String     --> Connection string URL (key defined connection type: sql, oracle, mySql....)
        Dictionary --> must have 'conn' (connection type) and 'url' (connection string).
    available connection can be found at dingDong.misc.enumsJson.eConn

"""
Config.CONNECTIONS = {
    'sampleSql': {'conn': 'sql',"url": "<Sql server connection string>;UID=USER;PWD=PWD;"},
    'file': "C:\\dingDong\\",
    'sqlite': "C:\\dingDong\\sqlLiteDB.db"}
  1. Creating workflow can be done as JSON format or python dictionaries For the sake of this example, we will use a python dictionary. The sample work-flow will contain:

  • mapping and loading CSV file named DATAELEMENTDESCRIPTION into SQLLite table named dateElements_Desc

  • mapping and loading CSV file named DEMOGRAPHICS into SqlLite table named demographics

  • mapping and loading CSV file named MEASURESOFBIRTHANDDEATH into SQLLite table named birthDate

  • create a new query based on demographics and birthDate into new table named Final

  • Update sample fields at Final table by using direct PL/SQL query

  • Extract Final data into a CSV file. We use VARCHAR(200) as default CSV column data type. configuration can be found at DEFAULTS under dingDong.conn.baseBatch

nodesToLoad = [
    {"source": ["file", "DATAELEMENTDESCRIPTION.csv"],
     "target": ["sqlite", "dateElements_Desc"]},

    {"source": ["file", "DEMOGRAPHICS.csv"],
     "target": ["sqlite", "demographics"]},

    {"source": ["file", "MEASURESOFBIRTHANDDEATH.csv"],
     "target": ["sqlite", "birthDate"]},

    {"query": ["sqlite", """   Select d.[State_FIPS_Code] AS A, d.[County_FIPS_Code] AS B, d.[County_FIPS_Code] AS G,d.[County_FIPS_Code], d.[CHSI_County_Name], d.[CHSI_State_Name],[Population_Size],[Total_Births],[Total_Deaths]
                                    From demographics d INNER JOIN birthDate b ON d.[County_FIPS_Code] = b.[County_FIPS_Code] AND d.[State_FIPS_Code] = b.[State_FIPS_Code]"""],
     "target": ["sqlite", "Finall", -1]},

    {"myexec": ["sqlite", "Update dateElements_Desc Set [Data_Type] = 'dingDong';"]},

    {"source": ["sqlite", "Finall"],
     "target": ["file", "finall.csv"]}
]
  1. Init class dingDong

dicObj:

Loading dictionary as a work flow

dirData:

Loading JSON files in this folder

includeFiles:

Include files to load from directory folder (dirData)

notIncldeFiles:

Exclude files to load from directory folder (dirData)

connDict:

Equal to Config.CONNECTIONS, set connection URLs string

processes:

Max number of parallel threading to load data (DONG module)

m = DingDong(dicObj=nodesToLoad,
             filePath=None,
             dirData=None,
             includeFiles=None,
             notIncludeFiles=None,
             connDict=None,
             processes=1)
  1. DING

  • creating dateElements_Desc, demographics and birthDate tables based on CSV files

  • extracting query structure and creating Final table

Flag -1 - default flag, indicate that on changed structure- old structure is stored with all data. object is udated to new strucutre

m.ding()
  1. DONG - Extracting and loading data from CSV files into SQLite table, using default truncate-> insert method Extract data from a query into Final table

  • if object structure changed and mode 1 (default mode) * history table will is created * new object structure is created. new object is populated with data from history table (only identical column name)

m.dong()

Full sample code:

import logging
from dingDong import DingDong
from dingDong import Config

Config.CONNECTIONS = {
    'x1'    : {'conn':'sql',"url":"DRIVER={SQL Server};SERVER=CPX-VLQ5GA42TW2\SQLEXPRESS;DATABASE=ContosoRetailDW;UID=bpmk;PWD=bpmk;"},
    'x2'    : {'conn':'sql',"url":"DRIVER={SQL Server};SERVER=CPX-VLQ5GA42TW2\SQLEXPRESS;DATABASE=ContosoRetailDW;UID=bpmk;PWD=bpmk;"},
    'file'  : "C:\\dingDong\\",
    'sqlite': "C:\\dingDong\\sqlLiteDB.db"}
Config.LOGS_DEBUG = logging.DEBUG
Config.LOGS_DIR = "C:\\dingDong"

nodesToLoad = [
        {   "source":["file","DATAELEMENTDESCRIPTION.csv"],
            "target":["sqlite","dateElements_Desc"]},

        {   "source":["file","DEMOGRAPHICS.csv"],
            "target":["sqlite","demographics"]},

        {   "source":["file","MEASURESOFBIRTHANDDEATH.csv"],
            "target":["sqlite","birthDate"]},

        {   "query":["sqlite","""   Select d.[State_FIPS_Code] AS A, d.[County_FIPS_Code] AS B, d.[County_FIPS_Code] AS G,d.[County_FIPS_Code], d.[CHSI_County_Name], d.[CHSI_State_Name],[Population_Size],[Total_Births],[Total_Deaths]
                                    From demographics d INNER JOIN birthDate b ON d.[County_FIPS_Code] = b.[County_FIPS_Code] AND d.[State_FIPS_Code] = b.[State_FIPS_Code]"""],
            "target":["sqlite","Final", 2]},

        {   "myexec":["sqlite","Update dateElements_Desc Set [Data_Type] = 'dingDong';"]},

        {   "source":["sqlite","Final"],
            "target":["file","final.csv"]}
      ]

m = DingDong(dicObj=nodesToLoad,
             filePath=None,
             dirData=None,
             includeFiles=None,
             notIncludeFiles=None,
             connDict=None,
             processes=1)
m.ding()
m.dong()

Road map

We would like to create a platform that will enable to design, implement and maintain data integration project such as:

  • Any REST API connectivity from any API to any API using simple JSON mapping

  • Any Relational database connectivity using JSON mapping

  • Any Non-relational storage

  • Main platform for any middle-ware business logic - from sample if-than-else up to statistics algorithms using ML and DL algorithms

  • Enable Real-time and scheduled integration

  • Single point of truth - maintain all changes by using git source control and enable to compare version and rollback as needed

We will extend our connectors and Meta-data manager accordingly.

BATCH supported connectors

connectors Type

python module

checked version

dev status

notes

sql

pyOdbc

4.0.23

tested, prod

slow to extract, massive data volume preferred using ceODBC

sql

ceODBC

2.0.1

tested, prod

sql server conn for massive data loading installed manually from 3rdPart folder

access

pyOdbc

4.0.23

tested, prod

oracle

cx-oracle

6.1

tested, prod

CSV / text files

CSV / CSV23

0.1.5

tested, prod

mysql

pyMySql

0.6.3rc1

dev

vertica

vertica-python

0.9.1

dev

sqllite

sqllite3

6.1

tested, prod

mongoDb

pyMongo

3.7.2

dev

salesforce

simple_salesforce

3.7.2

dev

haddop/Hive

.

.

dev

Authors

dingDONG was created by Tal Shany (tal@biSkilled.com)

We are looking for contributions !!!

License

GNU General Public License v3.0

See COPYING to see the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dingDONG-1.0.18.tar.gz (329.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page