Skip to main content

Using DuckDB and Python as high speed ETL tool

Project description

Ducktape

Let's build a data integration tool based on Python for flexibility and DuckDB for performance.

Scope

  1. Python and DuckDB can be used to read data into DuckDB. Within DuckDB SQL based transformations are applied. The result is written into a Target, again either using DuckDB's native features or Python. Source --> DuckDB --> SQL --> Target
  2. Delta handling as a first class citizen, meaning the structure supports before and after image, change indicators, can create the delta by comparing with the target and can be connected to CDC readers
  3. Certain transformations are hard to do in SQL, either because the SQL statement is complicated, or it can be expressed in 3GL code more easily or it needs multiple SQL statements. For these Python methods are implemented to achieve these transformations with a single function call.
  4. User interaction levels are
    • Via code - The user writes the Python code and thanks to the APIs, can be very quick.
    • Via UI - The user does drag and drop the transformations and with that configures the SQL statements and the sequence of transformations. Important: Code changes the UI representation and UI represeentation changes the code. The only thing stored is the code, the UI is just a visualization of the code in a different format.
    • Via intend - The user tells the source and the target properties and the transformations required to achieve that target, are obvious. Example: Read the customer master and create a Slow-Changing-Dimension table with its data. It's obvious what has to be done.
    • Via prompt - The user feeds the functional spec and the code is generated.

Transforms

  • Table Comparison: Take an input dataset (normal or CDC), compare with the target and create a delta dataset. This tells which records must be inserted, updated, optionally deleted and does remove all records that have not been changed.
  • SCD2 Generation: Take a CDC input and provide the changes to be applied in the target, e.g. Change the current version to an end-date of today and create create a new version with start-date today.
  • Pivot/Unpivot: Turn a dataset with multiple columns into a dataset with multiple rows and vice versa, including multiple sets. Example: Source has 12 columns for REVENUE_JAN, REVENUE_FEB,... and 12 columns for QTY_JAN, QTY_FEB,... Result should be 12 rows for each input row.
  • Hierarchy handling: Multiple transforms to deal with parent_child hierarchy tables, hierarchy column (City, region, country) and convert between these representations. Also validate hierarchies to ensure there are no loops in a parent child based hierarchy.
  • Temporal Join: Two datasets must be joined and each as a valid-from/valid-to date. Create a list of dates when something changed and join the data from both tables.
  • CDC operations: A transform capable of changing the CDC information and do calculations on the change data. Example: If the ACCOUNT_BALANCE changed from 100USD to 500USD, what is the increase? The after image value minus the before image value.
  • Address validation: Pass an address to an external service like Google Maps API to validate, correct and standardize the address.
  • Data Validation: Apply rules to the data and check if the record does as PASS/FAIL/WARN for each rule and overall.
  • lookup: Find the matching record in another table and handle cases where the lookup table returns multiple candidates

Qualities

  • Schema evolution
  • automatic create table
  • Logical validation

Operational Metadata and Validation

  • Each dataflow creates operational statistics, rows read, rows outputted, execution time, number of PASS/FAIL/WARN records.
  • The dataflow can have safeguard based on the statistics, e.g. if less than 100'000 rows are processed, consider the dataflow as failed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rtdi_ducktape-0.1.1.tar.gz (16.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rtdi_ducktape-0.1.1-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file rtdi_ducktape-0.1.1.tar.gz.

File metadata

  • Download URL: rtdi_ducktape-0.1.1.tar.gz
  • Upload date:
  • Size: 16.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for rtdi_ducktape-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8d6e78d423f1539a5eb9e5946238603e667c5b953bfbf5e7e0ddfa6de0f124bf
MD5 b9bcd4850cfeb3657c0dbbf6bf253fea
BLAKE2b-256 851ffe78c97b33c3a051db5433bcace1552e9697360cb9a1226b14c83b90d011

See more details on using hashes here.

File details

Details for the file rtdi_ducktape-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rtdi_ducktape-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for rtdi_ducktape-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c7253f2e0621acd1028130f0523ad303a5b9808196524afa57b8456888930986
MD5 247de0b30357c4bc9875ec3613f3d565
BLAKE2b-256 e73829804d3bea52e9baa6b3181c9530a74158d9c19c57cd04e2b56dd6ac0326

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page