Skip to main content

GlassFlow Python SDK: Create GlassFlow pipelines between Kafka and ClickHouse

Project description

GlassFlow Python SDK


A Python SDK for creating and managing data pipelines between Kafka and ClickHouse.

Features

  • Create and manage data pipelines between Kafka and ClickHouse
  • Deduplication of events during a time window based on a key
  • Temporal joins between topics based on a common key with a given time window
  • Schema validation and configuration management

Installation

pip install glassflow

Quick Start

Initialize client

from glassflow.etl import Client

# Initialize GlassFlow client
client = Client(host="your-glassflow-etl-url")

Create a pipeline

pipeline_config = {
    "version": "v2",
    "pipeline_id": "my-pipeline-id",
    "source": {
      "type": "kafka",
      "connection_params": {
        "brokers": [
          "http://my.kafka.broker:9093"
        ],
        "protocol": "PLAINTEXT",
        "mechanism": "NO_AUTH"
      },
      "topics": [
        {
          "consumer_group_initial_offset": "latest",
          "name": "users",
          "deduplication": {
            "enabled": True,
            "id_field": "event_id",
            "id_field_type": "string",
            "time_window": "1h"
          }
        }
      ]
    },
    "join": {
      "enabled": False
    },
    "sink": {
      "type": "clickhouse",
      "host": "http://my.clickhouse.server",
      "port": "9000",
      "database": "default",
      "username": "default",
      "password": "c2VjcmV0",
      "secure": False,
      "max_batch_size": 1000,
      "max_delay_time": "30s",
      "table": "users_dedup"
    },
    "schema": {
      "fields": [
        {
          "source_id": "users",
          "name": "event_id",
          "type": "string",
          "column_name": "event_id",
          "column_type": "UUID"
        },
        {
          "source_id": "users",
          "field_name": "user_id",
          "column_name": "user_id",
          "column_type": "UUID"
        },
        {
          "source_id": "users",
          "name": "created_at",
          "type": "string",
          "column_name": "created_at",
          "column_type": "DateTime"
        },
        {
          "source_id": "users",
          "name": "name",
          "type": "string",
          "column_name": "name",
          "column_type": "String"
        },
        {
          "source_id": "users",
          "name": "email",
          "type": "string",
          "column_name": "email",
          "column_type": "String"
        }
      ]
    }
}

# Create a pipeline
pipeline = client.create_pipeline(pipeline_config)

Get pipeline

# Get a pipeline by ID
pipeline = client.get_pipeline("my-pipeline-id")

List pipelines

pipelines = client.list_pipelines()
for pipeline in pipelines:
    print(f"Pipeline ID: {pipeline['pipeline_id']}")
    print(f"Name: {pipeline['name']}")
    print(f"Transformation Type: {pipeline['transformation_type']}")
    print(f"Created At: {pipeline['created_at']}")
    print(f"State: {pipeline['state']}")

Stop / Terminate / Resume Pipeline

pipeline = client.get_pipeline("my-pipeline-id")
pipeline.stop()
print(pipeline.status)
STOPPING
# Stop a pipeline ungracefully (terminate)
client.stop_pipeline("my-pipeline-id", terminate=True)
print(pipeline.status)
TERMINATING
pipeline = client.get_pipeline("my-pipeline-id")
pipeline.resume()
print(pipeline.status)
RESUMING

Delete pipeline

Only stopped or terminated pipelines can be deleted.

# Delete a pipeline
client.delete_pipeline("my-pipeline-id")

# Or delete via pipeline instance
pipeline.delete()

Pipeline Configuration

For detailed information about the pipeline configuration, see GlassFlow docs.

Tracking

The SDK includes anonymous usage tracking to help improve the product. Tracking is enabled by default but can be disabled in two ways:

  1. Using an environment variable:
export GF_TRACKING_ENABLED=false
  1. Programmatically using the disable_tracking method:
from glassflow.etl import Client

client = Client(host="my-glassflow-host")
client.disable_tracking()

The tracking collects anonymous information about:

  • SDK version
  • Platform (operating system)
  • Python version
  • Pipeline ID
  • Whether joins or deduplication are enabled
  • Kafka security protocol, auth mechanism used and whether authentication is disabled
  • Errors during pipeline creation and deletion

Development

Setup

  1. Clone the repository
  2. Create a virtual environment
  3. Install dependencies:
uv venv
source .venv/bin/activate
uv pip install -e .[dev]

Testing

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glassflow-3.7.2.tar.gz (87.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glassflow-3.7.2-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file glassflow-3.7.2.tar.gz.

File metadata

  • Download URL: glassflow-3.7.2.tar.gz
  • Upload date:
  • Size: 87.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for glassflow-3.7.2.tar.gz
Algorithm Hash digest
SHA256 b69b5ea35f1546760a750a6db58dd8e7cf4a6a12034d64daac7e610764f48040
MD5 e5e2e0a8e0bfb834bb6c7e694ee464af
BLAKE2b-256 cb751fb203e43d8c90196f2e9a300fa676957644420057e6fe2a63bc9fbfffcd

See more details on using hashes here.

File details

Details for the file glassflow-3.7.2-py3-none-any.whl.

File metadata

  • Download URL: glassflow-3.7.2-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for glassflow-3.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a3b7ede9d06af0914d963a20cf919cbb4e35e5b3cd8eebcc55bfe85f2d4bf2ae
MD5 d1cadb67a3b9fcb9344a67f6209a4f14
BLAKE2b-256 6f932fb6f9db9f33a93a597bbc510390b7277f3bfd73bef54ba82ec28effa1d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page