Skip to main content

Duckstring is a Python-native data-mesh orchestration framework designed to be as naturally extensible as installing packages.

Project description

Duckstring

Duckstring is a data pipeline framework built around modular, versioned nodes called Ponds. Each Pond specifies its immediate parents (with version), allowing for the formation of a DAG much like one would install packages.

Pond execution is orchestrated within an environment - a Catchment - that controls storage and other global settings. It uses a pull-based system modelled after Kanban, with Outlets (terminal Ponds) sending demand upstream. This allows each Pond to be modified and deployed independently, with any paths in the DAG that are not attached to any Outlet automatically skipped.

Duckstring is built on the philosophy that most data pipelines are not truly "big data" and with good design can execute on a single compute node. It is primarily designed for batch and incremental workloads for tables on the order of tens of millions of rows (e.g. <50M).

The default engine is DuckDB, though this is configurable. Duckstring is however an independent project and is not affiliated with, endorsed by, or maintained by the DuckDB project.

-- Note: As the project is in development, most of the notes below should be read as indended functionality, and most features are not yet implemented.

Core Concepts

  • Catchment: Control environment - a FastAPI application
  • Pond: Versioned transformation unit with declared upstream dependencies - the main element of version control
  • Inlet: Pond with external dependencies and no upstream Ponds
  • Outlet: Pond with no downstream Ponds (e.g. outputs final data products)
  • Ripple: Unit operation within a Pond (e.g. a single transformation producing a table)

Installation

pip install duckstring

Quickstart

1) Connect to a Catchment

A Catchment is the execution environment, receiving Ponds and managing runs. It runs either as a local daemon or as a remote server, allowing you to start locally and seamlessly upgrade to a hosted/cloud server if you need to later.

Start a Catchment Server

To run a Catchment locally, run:

duckstring catchment start --name dev --port 5000 --root ~/.duckstring/dev

This will start a server with name 'dev' at port 5000 (the default, if none specified) and store Catchment details at ~/.duckstring/dev (default is ~/.duckstring/{name}). If any of these options are omitted you will be prompted on start.

Connect to a Remote Server

Alternatively, you can connect to a server running a Catchment:

duckstring catchment connect --name dev --path https://path.to.catchment

This will prompt for any necessary auth, and will add the Catchment under the specified name.

Connect to duckstring.com

There are future plans for a dedicated Catchment service at https://duckstring.com. If you're interested, please contact me.

2) Define Pond(s)

Demo Ponds

If you want to see an example sequence of Ponds in action immediately, create three project directories and run one of these commands in each:

duckstring pond demo inlet
duckstring pond demo pond
duckstring pond demo outlet

It's recommended to do this before attempting to make your own so that you can get a feel for the structure.

Custom Pond

Create a project directory and run:

duckstring pond init example_pond

This will create a duckstring pond structure:

root/
|-- src/
|   |-- pond.py
|-- pond.toml
|-- __main__.py
|-- .gitignore
|-- README.md

Here pond.py contains the code for a single Ripple operation (currently blank), and pond.toml specifies the Pond name "example_pond" and version (defaulting to "0.1.0").

3) Deploy to Catchment

From Local

From a Pond's project root run:

duckstring deploy dev

This will read the pond name, version and type (Inlet, Pond, Outlet) from pond.toml and deploy the project contents to the Catchment specified by name (here dev).

Alternatively, you can import the Pond using the Catchment UI.

From Git

If you are using git with a remote, you can deploy with:

duckstring deploy dev --git {branch|commit|tag}

This will use the current branch/commit/tag to define the Pond. Upon each execution the Catchment will clone the repository and run it.

This can also be specified using the Catchment UI.

3) Execute

Ponds are executed by sending a Demand signal from an Outlet. This propagates backwards through the DAG until it reaches each upstream Inlet, causing them to execute, with children beginning upon completion of all of their parents.

These examples will use the Pond outlet, version 1.0.0, as the execution reference. All examples may also be alternatively executed using the Catchment UI.

Pulse

To initiate a single run:

duckstring pulse dev outlet

The pulse mode emits a Demand signal from outlet, and when it begins execution, sends a Stop signal. This causes it to execute exactly once.

This will automatically run against the maximum version available for that Pond. To use a specific version:

duckstring pulse dev outlet --version 1

Wave

To continuously run:

duckstring wave dev outlet

The wave mode emits a Demand signal from outlet, and when it begins execution, sends another Demand signal. This causes it to execute continuously, as frequently as the DAG allows (i.e. at a period equal to the execution time of the slowest Ripple in any Pond).

Tide

To run at a scheduled frequency:

duckstring tide dev outlet 15 2 * * * --local

This would run at 2:15am every day local time, using cron syntax. Omitting the --local flag defaults to UTC.

4) Monitor

To print out a summary of current processes in the Catchment:

duckstring status dev

This will print to CLI a summary for each Pond that is either currently executing or has Demand.

To include all Ponds:

duckstring status dev --all

5) Retrieve Data

Get

The simplest way to retrieve data is to load by the Ripple name. This returns the entire contents of the directory, and does not require that the data be in a tabular format (e.g. SQL-compatible).

duckstring get dev outlet daily

This writes a directory ./ponds/outlet/daily with the 'daily' Ripple's contents. You may also override the default location:

duckstring get dev outlet daily --path ./daily_output

SQL Query

If the target is an SQL-compatible table (e.g. DuckDB or Parquet), an SQL statement may be sent directly, outputting the result to the command line:

duckstring query dev outlet --sql "SELECT * FROM daily WHERE id=1;"

Alternatively, include a file path:

duckstring query dev outlet --sql @path/to/query.sql

Omitting the --sql statement queries with a default SELECT * LIMIT 10 on the specified table:

duckstring query dev outlet daily
Write to file

To output to a file, include a flag for the file format, followed by the file name:

--csv: Comma-separated values --json: JSON records --parquet: Parquet file

This writes by default to ./ponds/outlet/daily/{filename}. To overrite the default location you may use the --path flag.

For example, to execute an sql statement from file query.sql and write the result to CSV at the current directory:

duckstring query dev outlet --sql @query.sql --csv daily.csv --path .

Further Reading

For more detail on each component, please read the corresponding documentation in docs/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duckstring-0.1.0.tar.gz (220.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duckstring-0.1.0-py3-none-any.whl (235.7 kB view details)

Uploaded Python 3

File details

Details for the file duckstring-0.1.0.tar.gz.

File metadata

  • Download URL: duckstring-0.1.0.tar.gz
  • Upload date:
  • Size: 220.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for duckstring-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f245c307e69fea730d95d39c2587218e5f23e285c5caba7387ea2a2b17cbf977
MD5 379c1cf32a96f0f2bc3f5de2e3eb0fb4
BLAKE2b-256 2bac07fdf3c50d2498f94ebb78f8056bc09deec74ba47fd02c4c76bae549803e

See more details on using hashes here.

File details

Details for the file duckstring-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: duckstring-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 235.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for duckstring-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8257d0576a3e5dfb90b0ba7d6bf3183889d315670ddf7aac3a7052c891207ab
MD5 5918930e5a00435039a8cfff479b71ba
BLAKE2b-256 f210b0b1a1ef3efa3c3a1c01a40e2d59f1e6a2d65fbdedc874012bf7dbef5cc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page