Using DuckDB and Python as high speed ETL tool
Project description
Ducktape
Let's build a data integration tool based on Python for flexibility and DuckDB for performance.
Scope
- Python and DuckDB can be used to read data into DuckDB. Within DuckDB SQL based transformations are applied. The result is written into a Target, again either using DuckDB's native features or Python.
Source --> DuckDB --> SQL --> Target - Delta handling as a first class citizen, meaning the structure supports before and after image, change indicators, can create the delta by comparing with the target and can be connected to CDC readers
- Certain transformations are hard to do in SQL, either because the SQL statement is complicated, or it can be expressed in 3GL code more easily or it needs multiple SQL statements. For these Python methods are implemented to achieve these transformations with a single function call.
- User interaction levels are
- Via code - The user writes the Python code and thanks to the APIs, can be very quick.
- Via UI - The user does drag and drop the transformations and with that configures the SQL statements and the sequence of transformations. Important: Code changes the UI representation and UI represeentation changes the code. The only thing stored is the code, the UI is just a visualization of the code in a different format.
- Via intend - The user tells the source and the target properties and the transformations required to achieve that target, are obvious. Example: Read the customer master and create a Slow-Changing-Dimension table with its data. It's obvious what has to be done.
- Via prompt - The user feeds the functional spec and the code is generated.
Transforms
- Table Comparison: Take an input dataset (normal or CDC), compare with the target and create a delta dataset. This tells which records must be inserted, updated, optionally deleted and does remove all records that have not been changed.
- SCD2 Generation: Take a CDC input and provide the changes to be applied in the target, e.g. Change the current version to an end-date of today and create create a new version with start-date today.
- Pivot/Unpivot: Turn a dataset with multiple columns into a dataset with multiple rows and vice versa, including multiple sets. Example: Source has 12 columns for REVENUE_JAN, REVENUE_FEB,... and 12 columns for QTY_JAN, QTY_FEB,... Result should be 12 rows for each input row.
- Hierarchy handling: Multiple transforms to deal with parent_child hierarchy tables, hierarchy column (City, region, country) and convert between these representations. Also validate hierarchies to ensure there are no loops in a parent child based hierarchy.
- Temporal Join: Two datasets must be joined and each as a valid-from/valid-to date. Create a list of dates when something changed and join the data from both tables.
- CDC operations: A transform capable of changing the CDC information and do calculations on the change data. Example: If the ACCOUNT_BALANCE changed from 100USD to 500USD, what is the increase? The after image value minus the before image value.
- Address validation: Pass an address to an external service like Google Maps API to validate, correct and standardize the address.
- Data Validation: Apply rules to the data and check if the record does as PASS/FAIL/WARN for each rule and overall.
- lookup: Find the matching record in another table and handle cases where the lookup table returns multiple candidates
Qualities
- Schema evolution
- automatic create table
- Logical validation
Operational Metadata and Validation
- Each dataflow creates operational statistics, rows read, rows outputted, execution time, number of PASS/FAIL/WARN records.
- The dataflow can have safeguard based on the statistics, e.g. if less than 100'000 rows are processed, consider the dataflow as failed.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rtdi_ducktape-0.1.1.tar.gz
(16.4 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rtdi_ducktape-0.1.1.tar.gz.
File metadata
- Download URL: rtdi_ducktape-0.1.1.tar.gz
- Upload date:
- Size: 16.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d6e78d423f1539a5eb9e5946238603e667c5b953bfbf5e7e0ddfa6de0f124bf
|
|
| MD5 |
b9bcd4850cfeb3657c0dbbf6bf253fea
|
|
| BLAKE2b-256 |
851ffe78c97b33c3a051db5433bcace1552e9697360cb9a1226b14c83b90d011
|
File details
Details for the file rtdi_ducktape-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rtdi_ducktape-0.1.1-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7253f2e0621acd1028130f0523ad303a5b9808196524afa57b8456888930986
|
|
| MD5 |
247de0b30357c4bc9875ec3613f3d565
|
|
| BLAKE2b-256 |
e73829804d3bea52e9baa6b3181c9530a74158d9c19c57cd04e2b56dd6ac0326
|