Orchestration library for GCP data pipelines

Project description

gcp-pipeline-orchestration

Control library - Airflow DAGs, sensors, operators.

Depends on: gcp-pipeline-core
NO Apache Beam dependency.

Architecture

                      GCP-PIPELINE-ORCHESTRATION
                      ─────────────────────────

  ┌─────────────────────────────────────────────────────────────────┐
  │                     CONTROL LAYER                                │
  │                                                                  │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                      Sensors                             │    │
  │  │  • BasePubSubPullSensor (detect .ok files)              │    │
  │  │  • Filter by extension (.ok, .csv)                      │    │
  │  │  • Extract file metadata to XCom                        │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                    Operators                             │    │
  │  │  • BatchDataflowOperator (start batch ingestion)         │    │
  │  │  • StreamingDataflowOperator (start streaming)           │    │
  │  │  • ReconciliationOperator (validate counts)             │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                 Entity Dependency                        │    │
  │  │  • EntityDependencyChecker (wait for all entities)      │    │
  │  │  • Query job_control table for entity status            │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                   DAG Factories                          │    │
  │  │  • DAGFactory (generate DAGs from config)               │    │
  │  │  • Callbacks (on_failure, on_success)                   │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                                                                  │
  └─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                       Uses: gcp-pipeline-core

Orchestration Flow

  Pub/Sub                    Airflow                       External
  ───────                    ───────                       ────────

  .ok file     ┌─────────────────────────────────────────────────────┐
  notification │                                                     │
      │        │  ┌──────────────┐                                   │
      └───────►│  │ PubSub       │                                   │
               │  │ Pull Sensor  │                                   │
               │  │              │                                   │
               │  │ • Filter .ok │                                   │
               │  │ • Extract    │                                   │
               │  │   metadata   │                                   │
               │  └──────┬───────┘                                   │
               │         │                                           │
               │         ▼ (XCom: file_path, entity, date)           │
               │  ┌──────────────┐                                   │
               │  │ File         │                                   │
               │  │ Discovery    │                                   │
               │  │              │                                   │
               │  │ • Find all   │                                   │
               │  │   split files│                                   │
               │  └──────┬───────┘                                   │
               │         │                                           │
               │         ▼                                           │
               │  ┌──────────────┐    ┌──────────────┐               │
               │  │ Trigger      │───►│ Dataflow     │               │
               │  │ Dataflow     │    │ Job          │               │
               │  └──────────────┘    └──────┬───────┘               │
               │                             │ (Failure)             │
               │                             ▼                       │
               │                      ┌──────────────┐               │
               │                      │ Error Log    │               │
               │                      │ (BigQuery)   │               │
               │                      └──────┬───────┘               │
               │                             │                       │
               │         ┌───────────────────┘ (Success)             │
               │         │                                           │
               │         ▼                                           │
               │  ┌──────────────┐                                   │
               │  │ Dependency   │  (Application1 only - waits for 3 entities) │
               │  │ Checker      │                                   │
               │  └──────┬───────┘                                   │
               │         │                                           │
               │         ▼ (all ready)                               │
               │  ┌──────────────┐    ┌──────────────┐               │
               │  │ Trigger      │───►│ dbt          │               │
               │  │ dbt          │    │ Transform    │               │
               │  └──────────────┘    └──────────────┘               │
               │                                                     │
               │  ┌──────────────────────────────────────────────────┐
               │  │  PERIODIC MONITORING                             │
               │  │                                                  │
               │  │  ┌──────────────┐        ┌──────────────┐        │
               │  │  │ Error        │◄───────┤ Error Log    │        │
               │  │  │ Handling DAG │        │ (BigQuery)   │        │
               │  │  └──────┬───────┘        └──────────────┘        │
               │  │         │                                        │
               │  │         ▼                                        │
               │  │  ┌──────────────┐        ┌──────────────┐        │
               │  │  │ Automatic    │───Retry──► Target     │        │
               │  │  │ Reprocessing │        │ Pipeline     │        │
               │  │  └──────────────┘        └──────────────┘        │
               │  └──────────────────────────────────────────────────┘
               │                                                     │
               └─────────────────────────────────────────────────────┘

Entity Dependency Checker

For systems with multiple entities (like Application1 with 3 entities), the checker waits until all are loaded.

                    ENTITY DEPENDENCY CHECK (Application1)
                    ────────────────────────────

  Customers arrives    ──► Check: [✓] customers
  (4:00 PM)                       [ ] accounts
                                  [ ] decision
                                  → NOT READY

  Accounts arrives     ──► Check: [✓] customers
  (4:00 PM)                       [✓] accounts
                                  [ ] decision
                                  → NOT READY

  Decision arrives     ──► Check: [✓] customers
  (5:00 AM next day)              [✓] accounts
                                  [✓] decision
                                  → ALL READY! → Trigger dbt

How It Works

from datetime import date
from gcp_pipeline_orchestration.dependency import EntityDependencyChecker

# Configure for Application1 system
checker = EntityDependencyChecker(
    project_id="my-project",
    system_id="Application1",
    required_entities=["customers", "accounts", "decision"]
)

# Check if all entities are loaded for today
if checker.all_entities_loaded(extract_date=date.today()):
    # Logic to trigger dbt
    print("Triggering dbt...")
else:
    # Wait - some entities not yet loaded
    pass

Modules

Module	Purpose	Key Classes
`sensors/`	Pub/Sub sensing	`BasePubSubPullSensor`
`operators/`	Custom operators	`BatchDataflowOperator`, `StreamingDataflowOperator`
`factories/`	DAG generation	`DAGFactory`
`callbacks/`	Error handlers	`on_failure_callback`, `publish_to_dlq`
`routing/`	Pipeline routing	`PipelineRouter`
`dependency.py`	Entity dependency	`EntityDependencyChecker`

Key Findings

1. Unified Dataflow Operators

BaseDataflowOperator: Supports both Classic and Flex templates.
Development Stubbing: Features a clever mechanism to allow DAG parsing and testing without a live Airflow/GCP environment (BaseOperator if AIRFLOW_AVAILABLE else object).

2. Event-Driven Pub/Sub Sensors

BasePubSubPullSensor: Monitors GCS notifications (e.g., waiting for .ok files).
Metadata Extraction: Automated extraction of file paths, entity types, and timestamps into XCom for downstream use.

3. Entity Dependency Management

EntityDependencyChecker: Coordinates multi-entity systems (like Application1) by ensuring all required datasets (customers, accounts, decision) are present before triggering transformations.

4. Global Error Callbacks

Standardized failure handlers that publish metadata to DLQs (Dead Letter Queues) for automated alerting and manual intervention.

Error Handling & Reprocessing

The framework implements a two-tier error handling strategy: Immediate Capture and Periodic Recovery.

1. Immediate Capture (Callbacks)

When a task fails, the on_failure_callback from the library is triggered.

DLQ Publishing: Standardized task metadata (run_id, system_id, exception) is published to a Pub/Sub DLQ.
Audit Logging: The error is logged to the BigQuery error_log table for centralized tracking.

2. Periodic Recovery (Error Handling DAG)

A dedicated Error Handling DAG (e.g., application1_error_handling_dag.py) runs every 30 minutes to manage the lifecycle of failed records.

Automated Reprocessing Flow

  BigQuery Error Log          Error Handling DAG              Target Pipeline
  ──────────────────          ──────────────────              ───────────────

  [Error Record] ───►  1. Scan for unresolved  ───►  3. Transient? ───► Trigger Rerun
                          errors (<30m)                (Backoff applied)

                       2. Classify (via core)  ───►  4. Permanent? ───► Alert Team
                          (Validation vs Int)          (Manual Review)

Classification Logic

The Error Handling DAG uses the ErrorClassifier from gcp-pipeline-core to determine the next step:

Category	Strategy	Example
INTEGRATION	Automated Retry	Temporary connection timeout to GCS/BQ
RESOURCE	Exponential Backoff	Quota exceeded or Rate limiting
VALIDATION	Manual Review	Schema mismatch, invalid data types
CONFIGURATION	Manual Review	Missing Airflow variables or IAM permissions

Manual Intervention

For non-retryable errors (e.g., VALIDATION), the Error Handling DAG:

Quarantines the failed records/files.
Alerts the data engineering team via Email/Slack.
Audit Trail: Once a developer fixes the data and marks it as RETRY_READY in the error_log, the DAG will automatically pick it up in the next run.

Governance & Compliance

Domain Isolation: Depends on core and airflow; MUST NOT import beam.
Testing: All custom operators and sensors must be tested using the tester mocks.
Safety: Operators must support idempotency by passing run_id to underlying Dataflow jobs.

Usage

from gcp_pipeline_orchestration.sensors import BasePubSubPullSensor
from gcp_pipeline_orchestration.factories import DAGFactory
from gcp_pipeline_orchestration.dependency import EntityDependencyChecker
from gcp_pipeline_orchestration.callbacks import on_failure_callback

Tests

PYTHONPATH=src:../gcp-pipeline-core/src python -m pytest tests/unit/ -v
# 52 passed

Project details

Release history Release notifications | RSS feed

1.0.29

Mar 24, 2026

1.0.28

Mar 22, 2026

1.0.27

Mar 21, 2026

1.0.26

Mar 20, 2026

1.0.25

Mar 20, 2026

1.0.24

Mar 19, 2026

1.0.23

Mar 18, 2026

1.0.22

Mar 18, 2026

1.0.21

Mar 18, 2026

1.0.20

Mar 18, 2026

1.0.19

Mar 18, 2026

1.0.18

Mar 18, 2026

1.0.17

Mar 18, 2026

1.0.16

Mar 18, 2026

1.0.15

Mar 18, 2026

1.0.14

Mar 17, 2026

1.0.13

Mar 17, 2026

1.0.11

Mar 15, 2026

1.0.10

Mar 12, 2026

1.0.9

Mar 12, 2026

1.0.8

Mar 11, 2026

1.0.7

Mar 11, 2026

1.0.6

Mar 8, 2026

1.0.5

Mar 4, 2026

1.0.4

Mar 1, 2026

This version

1.0.3

Mar 1, 2026

1.0.2

Mar 1, 2026

1.0.1

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gcp_pipeline_orchestration-1.0.3.tar.gz (30.3 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gcp_pipeline_orchestration-1.0.3-py3-none-any.whl (36.9 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file gcp_pipeline_orchestration-1.0.3.tar.gz.

File metadata

Download URL: gcp_pipeline_orchestration-1.0.3.tar.gz
Upload date: Mar 1, 2026
Size: 30.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for gcp_pipeline_orchestration-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`f51c1368315f50604f9f700059c75e27715cc8f69137eab0c75fa1314c2bf188`
MD5	`f4e8ab1ff901c0068fa448b5dea920eb`
BLAKE2b-256	`b4071fe4dea2bf74507b081b8ee06d69a71aca93eda3154d2e546b9dac4fda18`

See more details on using hashes here.

File details

Details for the file gcp_pipeline_orchestration-1.0.3-py3-none-any.whl.

File metadata

Download URL: gcp_pipeline_orchestration-1.0.3-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 36.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for gcp_pipeline_orchestration-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5b0de7ad62ef02e3de3a26883bbd1ef0810bdd90097f10ceec7f48517c8444e`
MD5	`9433cd499e7c53d702384cb802356600`
BLAKE2b-256	`b35cf4fa4b41a858a7d399727d7d8ee4c89d574c2dc3bba5410ceb9b1d9418de`

See more details on using hashes here.

gcp-pipeline-orchestration 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

gcp-pipeline-orchestration

Architecture

Orchestration Flow

Entity Dependency Checker

How It Works

Modules

Key Findings

1. Unified Dataflow Operators

2. Event-Driven Pub/Sub Sensors

3. Entity Dependency Management

4. Global Error Callbacks

Error Handling & Reprocessing

1. Immediate Capture (Callbacks)

2. Periodic Recovery (Error Handling DAG)

Automated Reprocessing Flow

Classification Logic

Manual Intervention

Governance & Compliance

Usage

Tests

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes