Skip to main content

GlassFlow Clickhouse ETL Python SDK: Create GlassFlow pipelines between Kafka and ClickHouse

Project description

Clickhouse ETL Python SDK

Coverage

A Python SDK for creating and managing data pipelines between Kafka and ClickHouse.

Features

  • Create and manage data pipelines between Kafka and ClickHouse
  • Deduplication of events during a time window based on a key
  • Temporal joins between topics based on a common key with a given time window
  • Schema validation and configuration management

Installation

pip install glassflow-clickhouse-etl

Quick Start

from glassflow_clickhouse_etl import Pipeline


pipeline_config = {
  "pipeline_id": "test-pipeline",
  "source": {
    "type": "kafka",
    "provider": "aiven",
    "connection_params": {
      "brokers": ["localhoust:9092"],
      "protocol": "SASL_SSL",
      "mechanism": "SCRAM-SHA-256",
      "username": "user",
      "password": "pass"
    }
    "topics": [
      {
        "consumer_group_initial_offset": "earliest",
        "id": "test-topic",
        "name": "test-topic",
        "schema": {
          "type": "json",
          "fields": [
            {"name": "id", "type": "string" },
            {"name": "email", "type": "string"}
          ]
        },
        "deduplication": {
          "id_field": "id",
          "id_field_type": "string",
          "time_window": "1h",
          "enabled": True
        }
      }
    ],
  },
  "sink": {
    "type": "clickhouse",
    "host": "localhost:8443",
    "port": 8443,
    "database": "test",
    "username": "default",
    "password": "pass",
    "table_mapping": [
      {
        "source_id": "test_table",
        "field_name": "id",
        "column_name": "user_id",
        "column_type": "UUID"
      },
      {
        "source_id": "test_table",
        "field_name": "email",
        "column_name": "email",
        "column_type": "String"
      }
    ]
  }
}

# Create a pipeline from a JSON configuration
pipeline = Pipeline(pipeline_config)

# Create the pipeline
pipeline.create()

Configuration

For detailed information about the pipeline configuration, see CONFIGURATION.

Tracking

The SDK includes anonymous usage tracking to help improve the product. Tracking is enabled by default but can be disabled in two ways:

  1. Using an environment variable:
export GF_TRACKING_ENABLED=false
  1. Programmatically using the disable_tracking method:
pipeline = Pipeline(pipeline_config)
pipeline.disable_tracking()

The tracking collects anonymous information about:

  • SDK version
  • Platform (operating system)
  • Python version
  • Pipeline ID
  • Whether joins or deduplication are enabled

Development

Setup

  1. Clone the repository
  2. Create a virtual environment
  3. Install dependencies:
uv venv
source .venv/bin/activate
uv pip install -e .[dev]

Testing

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glassflow_clickhouse_etl-0.2.3.tar.gz (75.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glassflow_clickhouse_etl-0.2.3-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file glassflow_clickhouse_etl-0.2.3.tar.gz.

File metadata

File hashes

Hashes for glassflow_clickhouse_etl-0.2.3.tar.gz
Algorithm Hash digest
SHA256 f9d3b74b7e4af5366a5ee73c4c4556dd12c42a321ca3654561bdb607560a9ae1
MD5 4c287ffb45ac31360fda73a0892e3011
BLAKE2b-256 b44e9b9271e4aad1dc60f910f3418ceb8f31569dc68671b3796c354f84de3a18

See more details on using hashes here.

File details

Details for the file glassflow_clickhouse_etl-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for glassflow_clickhouse_etl-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 065f9539a2b43154baabed0fc828ada5fd06657b54ee7e5e8f71c8bed5252b28
MD5 72d40bc4204a7f1dff66ddcda7619c55
BLAKE2b-256 4cd0cb712d453be9b189643ee9248657d090dcf316ea9b1c55bfdff8969107ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page