Spark-based modular ETL pipeline framework.

These details have not been verified by PyPI

Project description

mkpipe

mkpipe is a Spark-based, modular ETL framework. It provides a Python-first API for building data pipelines with pluggable extractors, loaders, and inline transformations.

Key Features

PySpark engine — parallel, partitioned JDBC reads/writes for high-throughput data movement
Modular plugins — pip install mkpipe-extractor-postgres to add a source, pip install mkpipe-loader-postgres to add a destination
Single YAML config — define connections, pipelines, and tables in one file
Tag-based execution — assign tags to tables and run only what you need across all pipelines
Orchestrator-agnostic — use with Dagster, Airflow, cron, or any Python scheduler
Inline transformations — reference a Python function (df → df) directly in YAML
Dependency Injection — pass your own SparkSession (Glue, EMR, Dataproc) or let mkpipe create one
Incremental & full replication — append-only incremental with mkpipe_id for idempotent deduplication

Quick Start

pip install mkpipe mkpipe-extractor-postgres mkpipe-loader-postgres

Create mkpipe_project.yaml:

version: 2
default_environment: prod

prod:
  settings:
    timezone: UTC
    log_dir: /path/to/logs
    backend:
      variant: sqlite

  connections:
    source_pg:
      variant: postgres
      host: localhost
      port: 5432
      database: source_db
      user: ${PG_USER}
      password: ${PG_PASSWORD}
      schema: public

    target_pg:
      variant: postgres
      host: localhost
      port: 5432
      database: dwh_db
      user: ${PG_USER}
      password: ${PG_PASSWORD}
      schema: staging

  pipelines:
    - name: my_pipeline
      source: source_pg
      destination: target_pg
      tables:
        - name: public.users
          target_name: stg_users
          tags: [api, user-domain]
          replication_method: incremental
          iterate_column: updated_at
          iterate_column_type: datetime
          write_strategy: upsert
          write_key: [id]

        - name: public.orders
          target_name: stg_orders
          tags: [api, order-domain]
          replication_method: full

Run it:

# Run all tables
mkpipe run

# Run only tables tagged "api"
mkpipe run --tags api

Python API

import mkpipe

# Run all pipelines
mkpipe.run(config="mkpipe_project.yaml")

# Run a specific pipeline
mkpipe.run(config="mkpipe_project.yaml", pipeline="my_pipeline")

# Run a specific table
mkpipe.run(config="mkpipe_project.yaml", table="stg_users")

# Run by tags — runs matching tables across ALL pipelines
mkpipe.run(config="mkpipe_project.yaml", tags=["api"])
mkpipe.run(config="mkpipe_project.yaml", tags=["api", "order-domain"])

# Combine filters: pipeline + tags
mkpipe.run(config="mkpipe_project.yaml", pipeline="my_pipeline", tags=["api"])

# Pass a custom SparkSession (e.g. AWS Glue, EMR, Dataproc)
mkpipe.run(config="mkpipe_project.yaml", spark=my_spark_session)

# Extract only — returns ExtractResult with a Spark DataFrame
result = mkpipe.extract(config="mkpipe_project.yaml", table="stg_users")
df = result.df

# Load only — pass your own DataFrame
mkpipe.load(config="mkpipe_project.yaml", table="stg_users", df=my_df)

CLI Reference

mkpipe run [OPTIONS]
mkpipe install-jars

Command / Option	Short	Description
`mkpipe run`		Run pipelines from config file
`--config`	`-c`	Path to config file. Default: `mkpipe_project.yaml` in current dir
`--pipeline`	`-p`	Run only the named pipeline
`--table`	`-t`	Run only the named table (source name or target name)
`--tags`		Comma-separated tags to filter tables, e.g. `--tags api,ingestion`
`mkpipe install-jars`		Download Maven JARs for all installed plugins (offline/Docker use)

Examples:

# Run everything
mkpipe run

# Specific config file
mkpipe run --config /path/to/config.yaml

# Single pipeline
mkpipe run -p my_pipeline

# Single table
mkpipe run -t stg_users

# By tags (OR logic: any matching tag)
mkpipe run --tags api
mkpipe run --tags api,ingestion

# Combine: pipeline + tags
mkpipe run -p my_pipeline --tags api

YAML Configuration Reference

Top-level Structure

version: 2
default_environment: prod    # which environment block to use

prod:                        # environment name
  settings: ...
  connections: ...
  pipelines: ...

staging:                     # you can define multiple environments
  settings: ...
  connections: ...
  pipelines: ...

Settings

settings:
  timezone: UTC              # Spark session timezone (default: UTC)
  log_dir: ./logs            # Log file directory (optional, logs to console if not set)
  ingested_at_column: _ingested_at  # Column name for ingestion timestamp (default: _ingested_at)
  ingestion_id_column: mkpipe_id   # Column name for dedup hash ID (default: mkpipe_id)

  spark:
    master: "local[*]"       # Spark master URL (default: local[*])
    driver_memory: "4g"      # default: auto-detected from system
    executor_memory: "4g"    # default: auto-detected from system
    extra_config:            # any additional Spark config
      spark.sql.shuffle.partitions: "200"
      spark.dynamicAllocation.enabled: "true"

  backend:
    variant: sqlite          # sqlite (default), postgres, duckdb, clickhouse
    host: localhost
    port: 5432
    database: mkpipe_db
    user: mkpipe
    password: ${BACKEND_PASSWORD}

Connections

connections:
  my_postgres:
    variant: postgres
    host: ${PG_HOST}
    port: 5432
    database: mydb
    user: ${PG_USER}
    password: ${PG_PASSWORD}
    schema: public

  my_mongodb:
    variant: mongodb
    mongo_uri: ${MONGO_URI}
    database: mydb

  my_s3:
    variant: file
    extra:
      storage: s3
      format: parquet
      path: s3a://my-bucket/data
    aws_access_key: ${AWS_ACCESS_KEY}
    aws_secret_key: ${AWS_SECRET_KEY}
    region: eu-west-1

Environment variables are referenced with ${VAR_NAME} syntax and resolved at load time.

Connection Parameters

Parameter	Description
`variant`	Required. Plugin type: `postgres`, `mysql`, `mongodb`, `file`, etc.
`host`	Database host
`port`	Database port
`database`	Database name
`user`	Username
`password`	Password
`schema`	Schema name
`warehouse`	Warehouse (Snowflake)
`private_key_file`	Path to private key file (RSA auth)
`private_key_file_pwd`	Private key passphrase
`mongo_uri`	Full MongoDB connection URI
`bucket_name`	S3/GCS bucket name
`s3_prefix`	S3 key prefix
`aws_access_key`	AWS access key
`aws_secret_key`	AWS secret key
`region`	Cloud region
`credentials_file`	Path to credentials file (GCS service account)
`api_key`	API key
`oauth_token`	OAuth token
`client_id`	OAuth client ID
`client_secret`	OAuth client secret
`extra`	Dict of additional options (storage, format, path, etc.)

Pipelines & Tables

pipelines:
  - name: my_pipeline        # unique pipeline name
    source: source_pg         # connection name for extraction
    destination: target_pg    # connection name for loading
    pass_on_error: false      # if true, continue on table failure
    tables:
      - name: public.users
        target_name: stg_users
        tags: [api, user-domain]
        replication_method: incremental
        iterate_column: updated_at
        iterate_column_type: datetime
        partitions_column: id
        partitions_count: 10
        fetchsize: 100000
        batchsize: 10000
        write_partitions: 4
        dedup_columns: [id, updated_at]
        write_strategy: upsert
        write_key: [id]
        custom_query: "(SELECT id, name, updated_at FROM users {query_filter}) q"
        transform: transforms/clean_users.py::transform
        pass_on_error: false

Table Parameters

Parameter	Default	Description
`name`	required	Source table/collection name
`target_name`	required	Destination table name
`tags`	`[]`	List of tags for filtering (`--tags api,ingestion`)
`replication_method`	`full`	`full` or `incremental`
`iterate_column`	`None`	Column for incremental tracking (required if incremental)
`iterate_column_type`	`None`	`datetime` or `int`
`partitions_column`	iterate_column	Column for Spark JDBC partitioning
`partitions_column_type`	auto	Type of partition column: `int` or `datetime`. Defaults to `int` if `partitions_column` is specified, otherwise inherits `iterate_column_type`
`partitions_count`	`10`	Number of JDBC read partitions
`fetchsize`	`100000`	JDBC fetch size (rows per network round trip)
`batchsize`	`10000`	JDBC write batch size
`write_partitions`	`None`	Number of write partitions (coalesce before writing)
`dedup_columns`	`None`	Columns for dedup hash generation (xxhash64). Column name configurable via `settings.ingestion_id_column`
`custom_query`	`None`	Custom SQL query with `{query_filter}` placeholder
`custom_query_file`	`None`	Path to `.sql` file (relative to config directory)
`transform`	`None`	Transform function reference: `path/to/file.py::function`
`write_strategy`	`None`	Write strategy: `append`, `replace`, `upsert`, `merge` (see below)
`write_key`	`None`	Key columns for `upsert`/`merge` (required when strategy is upsert or merge)
`pass_on_error`	`false`	Continue pipeline on this table's failure

Write Strategy

write_strategy controls how data is written to the destination. If not set, it is inferred automatically: overwrite → replace, append → append.

Strategy	Behavior
`append`	Insert new rows. No deduplication.
`replace`	Drop/overwrite all existing data, then insert.
`upsert`	Insert new rows, update existing rows by `write_key`. Uses `MERGE`/`ON CONFLICT` for SQL databases.
`merge`	Full MERGE with matched update + not-matched insert (JDBC loaders only, same as upsert for most targets).

Usage

tables:
  # Upsert: update existing rows by primary key, insert new ones
  - name: public.users
    target_name: stg_users
    replication_method: incremental
    iterate_column: updated_at
    write_strategy: upsert
    write_key: [id]

  # Replace: full overwrite every run
  - name: public.orders
    target_name: stg_orders
    replication_method: full
    write_strategy: replace

  # Append (default for incremental): just insert
  - name: public.events
    target_name: stg_events
    replication_method: incremental
    iterate_column: created_at
    write_strategy: append

Supported Strategies per Loader

Loader	append	replace	upsert	merge
PostgreSQL, MySQL, MariaDB, SQL Server, Oracle, Redshift, SQLite, TimescaleDB	Y	Y	Y	Y
Snowflake	Y	Y	Y	Y
BigQuery	Y	Y	Y	Y
MongoDB	Y	Y	Y	—
ClickHouse	Y	Y	Y	—
Elasticsearch	Y	Y	Y	—
DynamoDB	Y	Y	Y	—
Cassandra	Y	Y	Y	—
InfluxDB	Y	Y	Y	—
Redis	—	Y	Y	—
File (Parquet/CSV/Iceberg/Delta)	Y	Y	—	—

Validation Rules

write_strategy: upsert or merge requires write_key — raises ConfigError if missing.
write_key is ignored when strategy is append or replace.
If write_strategy is not set, the strategy is inferred from the extractor's write mode.

Incremental Replication

mkpipe uses an append-only strategy for incremental replication:

Extract: reads rows where iterate_column >= last_point (inclusive, no boundary loss)
Load: appends to the target table (never overwrites or deletes)
Dedup: if dedup_columns is set, a mkpipe_id (xxhash64 hash) is generated for downstream deduplication

An ingestion timestamp column is always added to every row. The column name defaults to _ingested_at and can be customized via settings.ingested_at_column.

- name: public.users
  target_name: stg_users
  replication_method: incremental
  iterate_column: updated_at
  iterate_column_type: datetime
  dedup_columns: [id, updated_at]  # mkpipe_id = xxhash64(id, updated_at)

Partition Column Type Behavior

The partitions_column_type parameter controls how partition bounds are converted for Spark JDBC partitioning:

Scenario 1: No partition column specified (default)

- name: public.orders
  replication_method: incremental
  iterate_column: created_at
  iterate_column_type: datetime
  # partitions_column defaults to created_at
  # partitions_column_type inherits datetime from iterate_column_type

Scenario 2: Integer partition column specified

- name: public.customers
  replication_method: incremental
  iterate_column: updated_at
  iterate_column_type: datetime
  partitions_column: customer_id
  # partitions_column_type defaults to 'int' (most partition keys are integers)
  # Handles PostgreSQL NUMERIC/DECIMAL types correctly

Scenario 3: Explicit datetime partition column

- name: public.events
  replication_method: incremental
  iterate_column: event_id
  iterate_column_type: int
  partitions_column: event_timestamp
  partitions_column_type: datetime  # Must be explicit if different from iterate_column_type

Downstream dedup query example:

SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY mkpipe_id ORDER BY _ingested_at DESC) AS rn
  FROM stg_users
) WHERE rn = 1

Inline Transformations

Add a transform field to any table:

tables:
  - name: public.products
    target_name: stg_products
    replication_method: full
    transform: transforms/clean_products.py::transform

The transform function receives and returns a PySpark DataFrame:

# transforms/clean_products.py
def transform(df):
    df = df.filter(df.status != "deleted")
    return df

Orchestrator Integration

Dagster

from dagster import asset, Definitions
import mkpipe

@asset
def api_tables():
    mkpipe.run(config="mkpipe_project.yaml", tags=["api"])

@asset
def critical_tables():
    mkpipe.run(config="mkpipe_project.yaml", tags=["critical"])

defs = Definitions(assets=[api_tables, critical_tables])

Airflow

from airflow.decorators import task

@task
def sync_api_tables():
    import mkpipe
    mkpipe.run(config="/path/to/mkpipe_project.yaml", tags=["api"])

@task
def sync_user_domain():
    import mkpipe
    mkpipe.run(config="/path/to/mkpipe_project.yaml", tags=["user-domain"])

Backend (State Tracking)

mkpipe tracks pipeline state (last sync point, status) in a manifest database. Default is SQLite (zero-config). PostgreSQL, DuckDB, and ClickHouse are also supported:

settings:
  backend:
    variant: postgres
    host: localhost
    port: 5432
    database: mkpipe_db
    user: mkpipe
    password: ${BACKEND_PASSWORD}

Install optional backend dependencies:

pip install mkpipe[postgres-backend]
pip install mkpipe[duckdb-backend]
pip install mkpipe[clickhouse-backend]
pip install mkpipe[all-backends]

Custom Exceptions

mkpipe provides specific exception classes for clean error handling:

from mkpipe import (
    MkpipeError,          # base exception
    ConfigError,          # YAML or configuration issues
    ExtractionError,      # data extraction failures
    LoadError,            # data loading failures
    TransformError,       # transformation failures
    PluginNotFoundError,  # missing plugin
    BackendError,         # backend manifest failures
)

JAR Management

mkpipe plugins that depend on JDBC drivers or Spark connectors need JAR files. mkpipe handles this automatically — no manual steps required for most users.

Online (Default) — Lazy Download

When mkpipe starts, it detects which plugins are installed and resolves their Maven dependencies via spark.jars.packages. JARs are downloaded on first run and cached by Spark's Ivy resolver.

# Nothing to do — just run your pipeline
mkpipe run

Offline / Docker — Pre-download JARs

For air-gapped or on-premise environments without internet access, pre-download all JARs during the Docker build:

mkpipe install-jars

This command:

Discovers all installed plugins and their Maven dependencies
Downloads JARs via Spark's Ivy resolver into a temp directory
Copies them into each plugin's jars/ directory
Cleans up the temp Ivy cache

Dockerfile example:

FROM python:3.11-slim

# Install Java (required for PySpark)
RUN apt-get update && apt-get install -y default-jdk && rm -rf /var/lib/apt/lists/*

# Install mkpipe and plugins
RUN pip install mkpipe mkpipe-extractor-postgres mkpipe-loader-clickhouse

# Pre-download JARs (no internet needed at runtime)
RUN mkpipe install-jars

COPY mkpipe_project.yaml .
CMD ["mkpipe", "run"]

How It Works

Scenario	Local JARs in `jars/`	Maven resolution
Fresh install, online	No	Yes — `spark.jars.packages`
After `mkpipe install-jars`	Yes	No — local JARs used
Plugin with custom JAR (e.g. MongoDB `mkpipe-tls-helper.jar`)	Yes (custom)	Yes — only for missing Maven deps

CLI Reference

mkpipe install-jars    # Download all Maven JARs for installed plugins

Available Plugins

For the full list, visit the mkpipe-hub.

Extractors

PostgreSQL, MySQL, MariaDB, SQL Server, Oracle, SQLite, Redshift, ClickHouse, MongoDB, Snowflake, BigQuery, Cassandra, TimescaleDB, DynamoDB, Elasticsearch, InfluxDB, Redis, File (S3/GCS/local/Iceberg/Delta)

Loaders

PostgreSQL, MySQL, MariaDB, SQL Server, Oracle, SQLite, Redshift, ClickHouse, MongoDB, Snowflake, BigQuery, Cassandra, TimescaleDB, DynamoDB, Elasticsearch, InfluxDB, Redis, File (S3/GCS/local/Iceberg/Delta)

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.15.0

May 6, 2026

0.14.0

Apr 29, 2026

0.13.0

Apr 28, 2026

0.12.0

Apr 22, 2026

0.11.0

Apr 22, 2026

This version

0.10.0

Apr 21, 2026

0.9.3

Apr 20, 2026

0.9.2

Apr 20, 2026

0.9.1

Apr 20, 2026

0.9.0

Apr 20, 2026

0.8.1

Apr 8, 2026

0.8.0

Apr 4, 2026

0.7.17

Mar 30, 2026

0.7.16

Mar 30, 2026

0.7.15

Mar 30, 2026

0.7.14

Mar 17, 2026

0.7.13

Mar 17, 2026

0.7.12

Mar 13, 2026

0.7.11

Mar 13, 2026

0.7.10

Mar 13, 2026

0.7.9

Mar 13, 2026

0.7.8

Mar 13, 2026

0.7.7

Mar 12, 2026

0.7.6

Mar 12, 2026

0.7.5

Mar 12, 2026

0.7.3

Mar 10, 2026

0.7.2

Mar 3, 2026

0.7.1

Mar 2, 2026

0.7.0

Feb 27, 2026

0.6.4

Sep 12, 2025

0.6.3

Jul 27, 2025

0.6.2

Jul 27, 2025

0.6.1

Jul 27, 2025

0.5.1

Jul 26, 2025

0.5.0

Jul 26, 2025

0.4.6

Jul 23, 2025

0.4.5

Jul 23, 2025

0.4.4

Jul 23, 2025

0.4.3

Jul 23, 2025

0.4.2

Jul 23, 2025

0.4.1

Jul 23, 2025

0.4.0

Jul 23, 2025

0.3.12

Jul 10, 2025

0.3.11

Jul 3, 2025

0.3.10

Jul 3, 2025

0.3.9

Jul 3, 2025

0.3.8

Jul 3, 2025

0.3.7

Jul 2, 2025

0.3.6

Jun 24, 2025

0.3.5

Jun 17, 2025

0.3.4

Jun 13, 2025

0.3.3

Jun 13, 2025

0.3.2

Jun 13, 2025

0.3.1

Jun 12, 2025

0.3.0

Jun 12, 2025

0.2.9

May 26, 2025

0.2.8

May 26, 2025

0.2.7

May 21, 2025

0.2.6

May 21, 2025

0.2.5

May 21, 2025

0.2.4

May 20, 2025

0.2.3

May 20, 2025

0.2.2

May 20, 2025

0.2.1

May 20, 2025

0.2.0

May 20, 2025

0.1.55

May 17, 2025

0.1.54

May 17, 2025

0.1.53

Apr 7, 2025

0.1.52

Apr 7, 2025

0.1.51

Apr 7, 2025

0.1.50

Apr 7, 2025

0.1.49

Apr 7, 2025

0.1.48

Dec 14, 2024

0.1.47

Dec 14, 2024

0.1.46

Dec 13, 2024

0.1.45

Dec 13, 2024

0.1.44

Dec 13, 2024

0.1.43

Dec 13, 2024

0.1.42

Dec 12, 2024

0.1.41

Dec 12, 2024

0.1.40

Dec 12, 2024

0.1.39

Dec 12, 2024

0.1.38

Dec 12, 2024

0.1.37

Dec 12, 2024

0.1.36

Dec 11, 2024

0.1.35

Dec 11, 2024

0.1.33

Dec 11, 2024

0.1.32

Dec 11, 2024

0.1.31

Dec 11, 2024

0.1.30

Dec 10, 2024

0.1.29

Dec 10, 2024

0.1.28

Dec 10, 2024

0.1.27

Dec 10, 2024

0.1.26

Dec 10, 2024

0.1.24

Dec 10, 2024

0.1.23

Dec 10, 2024

0.1.22

Dec 10, 2024

0.1.21

Dec 10, 2024

0.1.20

Dec 10, 2024

0.1.19

Dec 10, 2024

0.1.18

Dec 10, 2024

0.1.17

Dec 9, 2024

0.1.16

Dec 9, 2024

0.1.15

Dec 9, 2024

0.1.14

Dec 9, 2024

0.1.13

Dec 9, 2024

0.1.12

Dec 8, 2024

0.1.11

Dec 4, 2024

0.1.9

Dec 1, 2024

0.1.2

Dec 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mkpipe-0.10.0.tar.gz (39.9 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mkpipe-0.10.0-py3-none-any.whl (39.6 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file mkpipe-0.10.0.tar.gz.

File metadata

Download URL: mkpipe-0.10.0.tar.gz
Upload date: Apr 21, 2026
Size: 39.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for mkpipe-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`1c196c1dbc9556709ce2ad34bb195b8737e8e59d81117f43f590a68dce841822`
MD5	`2d33c045110fb563765da47c5802059d`
BLAKE2b-256	`0fcea17d68f8eeaf6316f5467648857a195617e6531764dee1c6b5d6ebfa1ca7`

See more details on using hashes here.

File details

Details for the file mkpipe-0.10.0-py3-none-any.whl.

File metadata

Download URL: mkpipe-0.10.0-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 39.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for mkpipe-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`69ab1cd1bb4127025269b46ec957ea034ed04aa28801c2fe446f260d6b9daec9`
MD5	`1b4b77da9dcd8de887899fb8e40b2fe7`
BLAKE2b-256	`8a8bfb36a44ac66d04d01cdb8e64e703d45c64b07606d4d8e2e434fb5f7dbcc6`

See more details on using hashes here.

mkpipe 0.10.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

mkpipe

Key Features

Quick Start

Python API

CLI Reference

Tags

YAML Configuration Reference

Top-level Structure

Settings

Connections

Connection Parameters

Pipelines & Tables

Table Parameters

Write Strategy

Usage

Supported Strategies per Loader

Validation Rules

Incremental Replication

Partition Column Type Behavior

Inline Transformations

Orchestrator Integration

Dagster

Airflow

Backend (State Tracking)

Custom Exceptions

JAR Management

Online (Default) — Lazy Download

Offline / Docker — Pre-download JARs

How It Works

CLI Reference

Available Plugins

Extractors

Loaders

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes