Clickhouse loader for mkpipe.
Project description
mkpipe-loader-clickhouse
ClickHouse loader plugin for MkPipe. Writes Spark DataFrames into ClickHouse tables using the native clickhouse-spark connector, which uses ClickHouse's binary HTTP protocol for efficient columnar inserts.
Documentation
For more detailed documentation, please visit the GitHub repository.
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Connection Configuration
connections:
clickhouse_target:
variant: clickhouse
host: localhost
port: 8123
database: target_db
user: default
password: mypassword
Table Configuration
pipelines:
- name: pg_to_clickhouse
source: pg_source
destination: clickhouse_target
tables:
- name: public.events
target_name: stg_events
replication_method: full
batchsize: 50000
Write Parallelism & Throughput
ClickHouse loader inherits from JdbcLoader. Two parameters control write performance:
- name: public.events
target_name: stg_events
replication_method: full
batchsize: 50000 # rows per JDBC batch insert (default: 10000)
write_partitions: 4 # coalesce DataFrame to N partitions before writing
How they work
batchsize: number of rows buffered before sending a singleINSERTto ClickHouse. ClickHouse benefits greatly from large batches — use 50,000–500,000 for best throughput.write_partitions: callscoalesce(N)on the DataFrame before writing, reducing the number of concurrent JDBC connections. Useful when you have many Spark partitions and want to limit load on ClickHouse.
Performance Notes
- ClickHouse is optimized for large bulk inserts.
batchsizeis the most impactful parameter — increase it as much as your driver memory allows. - Avoid many small
write_partitions(e.g. 1) as it reduces parallelism. A value of 4–8 balances load and throughput. - ClickHouse's MergeTree engine merges parts in the background; very frequent small inserts create many parts and degrade query performance. Prefer fewer large batches.
All Table Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
string | required | Source table name |
target_name |
string | required | ClickHouse destination table name |
replication_method |
full / incremental |
full |
Replication strategy |
batchsize |
int | 10000 |
Rows per JDBC batch insert |
write_partitions |
int | — | Coalesce DataFrame to N partitions before writing |
dedup_columns |
list | — | Columns used for mkpipe_id hash deduplication |
tags |
list | [] |
Tags for selective pipeline execution |
pass_on_error |
bool | false |
Skip table on error instead of failing |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mkpipe_loader_clickhouse-0.5.0.tar.gz.
File metadata
- Download URL: mkpipe_loader_clickhouse-0.5.0.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6058fab7298bdf91cf1ec11417d7672ff0ac7236d630799660f0b4f85a5c75c3
|
|
| MD5 |
c2169d5095bec78c96db1a46d0a7e2ff
|
|
| BLAKE2b-256 |
83ee6cef339cc79e9571e87a830061d247a20b1d8743fff91ae42e522e59aca9
|
File details
Details for the file mkpipe_loader_clickhouse-0.5.0-py3-none-any.whl.
File metadata
- Download URL: mkpipe_loader_clickhouse-0.5.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
417765024b0f99517caea6a86d4b020fbd9740805edd634c47d1a6def9785182
|
|
| MD5 |
a2d4ed4f4b248e51dd82764295032e29
|
|
| BLAKE2b-256 |
d3446756f46a4395d50b9b013b9a4110486736b39b7255f8f94a3bfe7953bec9
|