Skip to main content

Beam ingestion library for GCP data pipelines

Project description

gcp-pipeline-beam

Ingestion library - Beam pipelines, transforms, file management.

Depends on: gcp-pipeline-core
NO Apache Airflow dependency.


Architecture

                         GCP-PIPELINE-BEAM
                         ─────────────────

  ┌─────────────────────────────────────────────────────────────────┐
  │                     INGESTION LAYER                              │
  │                                                                  │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                    File Management                       │    │
  │  │  • HDR/TRL Parser (header/trailer validation)           │    │
  │  │  • Split File Handler (reassemble split files)           │    │
  │  │  • File Archiver (move to archive bucket)               │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                     Validators                           │    │
  │  │  • SchemaValidator (validate against EntitySchema)      │    │
  │  │  • SSN, Date, Numeric validators                        │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                   Beam Transforms                        │    │
  │  │  • ParseCsvLine (parse CSV to dict)                     │    │
  │  │  • ValidateRecordDoFn (schema validation)               │    │
  │  │  • AddAuditColumnsDoFn (add _run_id, etc.)              │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                   Base Pipeline                          │    │
  │  │  • BasePipeline (abstract class)                        │    │
  │  │  • PipelineConfig, PipelineOptions                      │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                                                                  │
  └─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                       Uses: gcp-pipeline-core

Ingestion Flow

  GCS Landing              Beam Pipeline                    BigQuery
  ───────────              ─────────────                    ────────

  file.csv  ──────►  ┌─────────────────────┐
  file.csv.ok        │                     │
                     │  1. HDRTRLParser    │
                     │     • Validate HDR  │
                     │     • Validate TRL  │
                     │     • Check count   │
                     │                     │
                     │  2. ParseCsvLine    │
                     │     • CSV to dict   │
                     │                     │
                     │  3. SchemaValidator │
                     │     • Required      │────► Valid records ──► BigQuery
                     │     • Types         │
                     │     • Allowed vals  │────► Invalid ──► Error bucket
                     │                     │
                     │  4. AddAuditColumns │
                     │     • _run_id       │
                     │     • _source_file  │
                     │     • _processed_at │
                     │                     │
                     └─────────────────────┘
                              │
                              ▼
                     ┌─────────────────────┐
                     │  Archive to GCS     │
                     └─────────────────────┘

Split File Handling

The system supports processing files that have been split into multiple parts. The .ok file signals ALL splits are ready.

  GCS Landing Bucket                         Pub/Sub & Processing
  ──────────────────                         ────────────────────

  customers_1.csv  ──┐
  customers_2.csv  ──┼── (data files)
  customers_3.csv  ──┘
         │
         │
  customers.csv.ok ─────► Pub/Sub Notification
         │                      │
         │                      ▼
         │               ┌─────────────────┐
         │               │ Airflow Sensor  │
         │               │ (detects .ok)   │
         │               └────────┬────────┘
         │                        │
         │                        ▼
         │               ┌─────────────────┐
         │               │ File Discovery  │
         │               │ • List bucket   │
         │               │ • Find splits:  │
         │               │   customers_*.csv
         │               └────────┬────────┘
         │                        │
         └────────────────────────┘
                                  │
                                  ▼
                         ┌─────────────────┐
                         │ Process ALL     │
                         │ split files     │
                         │ in single job   │
                         └─────────────────┘

Split File Discovery Logic

# 1. Pub/Sub receives notification for .ok file
#    Message: {"name": "application1/customers/customers.csv.ok", "bucket": "landing"}

# 2. Sensor extracts entity name from .ok file
#    entity = "customers"  (from customers.csv.ok)

# 3. File discovery finds all matching splits
#    pattern = f"gs://landing/application1/customers/customers*.csv"
#    files = [
#        "gs://landing/application1/customers/customers_1.csv",
#        "gs://landing/application1/customers/customers_2.csv", 
#        "gs://landing/application1/customers/customers_3.csv",
#    ]

# 4. All files processed in single Dataflow job
#    pipeline.read_from_gcs(files)  # Reads all splits

Key Points

Aspect Behavior
Trigger Only .ok file triggers processing
Discovery Pattern match: {entity}*.csv or {entity}_*.csv
Processing All splits processed in single Dataflow job
Validation Each split has own HDR/TRL - all validated
Audit All records get same _run_id

Modules

Module Purpose Key Classes
file_management/ HDR/TRL parsing, archival HDRTRLParser, FileArchiver
validators/ Schema-driven validation SchemaValidator, ValidationError
pipelines/base/ Base classes BasePipeline, PipelineConfig
pipelines/beam/transforms/ Beam DoFns ParseCsvLine, ValidateRecordDoFn

Key Findings

1. Advanced HDR/TRL Parsing

  • Configurable Parser: Highly flexible regex-based parsing for header and trailer validation.
  • Support: Handles custom patterns, prefixes, and multi-field extraction for diverse source systems.
  • Validation: Automated record count and checksum verification against trailer values.

2. Fluent Pipeline API

  • BeamPipelineBuilder: Provides a clean, chainable interface for building pipelines:
    • read_csv() / read_avro()
    • validate() (Schema-driven)
    • transform() (Custom business logic)
    • write_to_bigquery() / write_to_gcs()

3. Schema Validation & PII Masking

  • SchemaValidator: Validates records against EntitySchema definitions from core.
  • In-flight Masking: Supports PII masking during the ingestion process, ensuring sensitive data is protected before landing in BigQuery.

4. Split File Handling

  • Specialized logic for reassembling and processing split files from source systems.

Governance & Compliance

  • Domain Isolation: Depends on core and beam; MUST NOT import airflow.
  • Testing: Every transform and pipeline component requires unit tests using gcp-pipeline-tester.
  • Reuse: Prefer using BeamPipelineBuilder for consistent pipeline construction.

Usage

from gcp_pipeline_beam.file_management import HDRTRLParser, FileArchiver
from gcp_pipeline_beam.validators import SchemaValidator
from gcp_pipeline_beam.pipelines.base import BasePipeline, PipelineConfig
from gcp_pipeline_beam.pipelines.beam.transforms import ParseCsvLine, ValidateRecordDoFn

Tests

PYTHONPATH=src:../gcp-pipeline-core/src python -m pytest tests/unit/ -v
# 358 passed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gcp_pipeline_beam-1.0.5.tar.gz (47.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gcp_pipeline_beam-1.0.5-py3-none-any.whl (64.3 kB view details)

Uploaded Python 3

File details

Details for the file gcp_pipeline_beam-1.0.5.tar.gz.

File metadata

  • Download URL: gcp_pipeline_beam-1.0.5.tar.gz
  • Upload date:
  • Size: 47.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for gcp_pipeline_beam-1.0.5.tar.gz
Algorithm Hash digest
SHA256 375fd6a67edc80c2b680b1e80061580c2f3888476a2a63353af6ef6e0f906303
MD5 76d6973c9379504fc0a22b1642855704
BLAKE2b-256 7f9706c71f23be61e92dae1ddceb6a31cdc7b3ce6b9765c3d715f2f0f6b0afc2

See more details on using hashes here.

File details

Details for the file gcp_pipeline_beam-1.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for gcp_pipeline_beam-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c1545623d8e2fbe9a2bc0057d7ce9cc823e791db6f0c6635660ed627ead6de76
MD5 739c3c34e2b9b910c8774b98de4ba77f
BLAKE2b-256 e0cf5b4053a3773a91383646082204814aa8562dbd0a2b2a1e309f0bb6f8eb9d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page