Clickhouse loader for mkpipe.
Project description
mkpipe-loader-clickhouse
ClickHouse loader plugin for MkPipe. Writes Spark DataFrames into ClickHouse tables using the native clickhouse-spark connector, which uses ClickHouse's binary HTTP protocol for efficient columnar inserts.
Documentation
For more detailed documentation, please visit the GitHub repository.
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Connection Configuration
connections:
clickhouse_target:
variant: clickhouse
host: localhost
port: 8123
database: target_db
user: default
password: mypassword
Table Configuration
pipelines:
- name: pg_to_clickhouse
source: pg_source
destination: clickhouse_target
tables:
- name: public.events
target_name: stg_events
replication_method: full
batchsize: 50000
Write Strategy
Control how data is written to ClickHouse:
- name: public.events
target_name: stg_events
write_strategy: upsert # append | replace | upsert
write_key: [id] # required for upsert
| Strategy | ClickHouse Behavior |
|---|---|
append |
Insert via ClickHouse Spark connector (default for incremental) |
replace |
Drop and recreate table, then insert (default for full) |
upsert |
Creates table with ReplacingMergeTree engine using write_key as ORDER BY. ClickHouse deduplicates rows with the same key on background merges. |
Note: ClickHouse does not support SQL
MERGEstatements. Upsert semantics are achieved viaReplacingMergeTree, which deduplicates asynchronously during compaction. UseFINALin queries to get deduplicated results.
Write Parallelism & Throughput
ClickHouse loader inherits from JdbcLoader. Two parameters control write performance:
- name: public.events
target_name: stg_events
replication_method: full
batchsize: 50000 # rows per JDBC batch insert (default: 10000)
write_partitions: 4 # coalesce DataFrame to N partitions before writing
How they work
batchsize: number of rows buffered before sending a singleINSERTto ClickHouse. ClickHouse benefits greatly from large batches — use 50,000–500,000 for best throughput.write_partitions: callscoalesce(N)on the DataFrame before writing, reducing the number of concurrent JDBC connections. Useful when you have many Spark partitions and want to limit load on ClickHouse.
Performance Notes
- ClickHouse is optimized for large bulk inserts.
batchsizeis the most impactful parameter — increase it as much as your driver memory allows. - Avoid many small
write_partitions(e.g. 1) as it reduces parallelism. A value of 4–8 balances load and throughput. - ClickHouse's MergeTree engine merges parts in the background; very frequent small inserts create many parts and degrade query performance. Prefer fewer large batches.
All Table Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
string | required | Source table name |
target_name |
string | required | ClickHouse destination table name |
replication_method |
full / incremental |
full |
Replication strategy |
batchsize |
int | 10000 |
Rows per JDBC batch insert |
write_partitions |
int | — | Coalesce DataFrame to N partitions before writing |
write_strategy |
string | — | append, replace, upsert |
write_key |
list | — | Key columns for upsert (used as ReplacingMergeTree ORDER BY) |
dedup_columns |
list | — | Columns used for mkpipe_id hash deduplication |
tags |
list | [] |
Tags for selective pipeline execution |
pass_on_error |
bool | false |
Skip table on error instead of failing |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mkpipe_loader_clickhouse-0.7.0.tar.gz.
File metadata
- Download URL: mkpipe_loader_clickhouse-0.7.0.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e06209fa432a4982eceb0df89b1cfc4bb0025f38f450b4d3644a7e856bd94ed8
|
|
| MD5 |
8dcfdd0f945851e17a5c064c2ee9e551
|
|
| BLAKE2b-256 |
a5bd1f7f0f7c509967e9a31b78ba8b3559116d98526a1f3991ecb54e10d19305
|
File details
Details for the file mkpipe_loader_clickhouse-0.7.0-py3-none-any.whl.
File metadata
- Download URL: mkpipe_loader_clickhouse-0.7.0-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59e1b27a6ba1dc4ca52603ebe22c6a208cd7ea90286bfa1b3d7f8b31519dbd14
|
|
| MD5 |
e72852581aff13f60e0d58d2346b81ae
|
|
| BLAKE2b-256 |
aa0421e1f41018453bc417586cecf76f0bdb238677d03c60c60962cc4fb13410
|