Skip to main content

GlassFlow Clickhouse ETL Python SDK: Create GlassFlow pipelines between Kafka and ClickHouse

Project description

Clickhouse ETL Python SDK


A Python SDK for creating and managing data pipelines between Kafka and ClickHouse.

Features

  • Create and manage data pipelines between Kafka and ClickHouse
  • Deduplication of events during a time window based on a key
  • Temporal joins between topics based on a common key with a given time window
  • Schema validation and configuration management

Installation

pip install glassflow-clickhouse-etl

Quick Start

Initialize client

from glassflow_clickhouse_etl import Client

# Initialize GlassFlow client
client = Client(host="your-glassflow-etl-url")

Create a pipeline

pipeline_config = {
    "pipeline_id": "deduplication-demo-pipeline",
    "source": {
      "type": "kafka",
      "provider": "confluent",
      "connection_params": {
        "brokers": [
          "kafka:9093"
        ],
        "protocol": "PLAINTEXT",
        "skip_auth": True
      },
      "topics": [
        {
          "consumer_group_initial_offset": "latest",
          "name": "users",
          "schema": {
            "type": "json",
            "fields": [
              {
                "name": "event_id",
                "type": "string"
              },
              {
                "name": "user_id",
                "type": "string"
              },
              {
                "name": "name",
                "type": "string"
              },
              {
                "name": "email",
                "type": "string"
              },
              {
                "name": "created_at",
                "type": "string"
              }
            ]
          },
          "deduplication": {
            "enabled": True,
            "id_field": "event_id",
            "id_field_type": "string",
            "time_window": "1h"
          }
        }
      ]
    },
    "join": {
      "enabled": False
    },
    "sink": {
      "type": "clickhouse",
      "provider": "localhost",
      "host": "clickhouse",
      "port": "9000",
      "database": "default",
      "username": "default",
      "password": "c2VjcmV0",
      "secure": False,
      "max_batch_size": 1000,
      "max_delay_time": "30s",
      "table": "users_dedup",
      "table_mapping": [
        {
          "source_id": "users",
          "field_name": "event_id",
          "column_name": "event_id",
          "column_type": "UUID"
        },
        {
          "source_id": "users",
          "field_name": "user_id",
          "column_name": "user_id",
          "column_type": "UUID"
        },
        {
          "source_id": "users",
          "field_name": "created_at",
          "column_name": "created_at",
          "column_type": "DateTime"
        },
        {
          "source_id": "users",
          "field_name": "name",
          "column_name": "name",
          "column_type": "String"
        },
        {
          "source_id": "users",
          "field_name": "email",
          "column_name": "email",
          "column_type": "String"
        }
      ]
    }
}

# Create a pipeline
pipeline = client.create_pipeline(pipeline_config)

Get pipeline

# Get a pipeline by ID
pipeline = client.get_pipeline("my-pipeline-id")

List pipelines

pipelines = client.list_pipelines()
for pipeline in pipelines:
    print(f"Pipeline ID: {pipeline['pipeline_id']}")
    print(f"Name: {pipeline['name']}")
    print(f"Transformation Type: {pipeline['transformation_type']}")
    print(f"Created At: {pipeline['created_at']}")
    print(f"State: {pipeline['state']}")

Delete pipeline

# Delete a pipeline
client.delete_pipeline("my-pipeline-id")

# Or delete via pipeline instance
pipeline.delete()

Update pipeline

# Update the sink table name
config_patch = {
  "sink": {
    "table": "new_table_name"
  }
}

pipeline = client.get_pipeline("my-pipeline-id")
pipeline.update(config_patch)

Pause / Resume pipeline

# Will stop ingesting new messages and finish processing the messages inside the pipeline
pipeline.pause()

# Will resume ingestion
pipeline.resume()

Pipeline Configuration

For detailed information about the pipeline configuration, see GlassFlow docs.

Tracking

The SDK includes anonymous usage tracking to help improve the product. Tracking is enabled by default but can be disabled in two ways:

  1. Using an environment variable:
export GF_TRACKING_ENABLED=false
  1. Programmatically using the disable_tracking method:
from glassflow_clickhouse_etl import Client

client = Client(host="my-glassflow-host")
client.disable_tracking()

The tracking collects anonymous information about:

  • SDK version
  • Platform (operating system)
  • Python version
  • Pipeline ID
  • Whether joins or deduplication are enabled
  • Kafka security protocol, auth mechanism used and whether authentication is disabled
  • Errors during pipeline creation and deletion

Development

Setup

  1. Clone the repository
  2. Create a virtual environment
  3. Install dependencies:
uv venv
source .venv/bin/activate
uv pip install -e .[dev]

Testing

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glassflow_clickhouse_etl-1.0.0.tar.gz (78.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glassflow_clickhouse_etl-1.0.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file glassflow_clickhouse_etl-1.0.0.tar.gz.

File metadata

File hashes

Hashes for glassflow_clickhouse_etl-1.0.0.tar.gz
Algorithm Hash digest
SHA256 713a7ddb713d86e5eb33a71fcfe6ceaea9c0ead73f26616b88bbe3e97ea17c6e
MD5 93b47a3b1285ebd24e506ad7d49ee05e
BLAKE2b-256 dae30b323f0bfad46d9ca00ac7c428f5ac42665d268ac6ba1d30ff5265d78a1f

See more details on using hashes here.

File details

Details for the file glassflow_clickhouse_etl-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for glassflow_clickhouse_etl-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9b20e359b923fb10aa9bc09099d882014054d987e58d829962ee38cdca7929ad
MD5 11cf0468f9a806143ab662c19f91a0ad
BLAKE2b-256 eefbf972ad8f2160ff99219c1364a8596087fd7f6412783c3a682570ce016a6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page