Beam ingestion library for GCP data pipelines

Project description

gcp-pipeline-beam

Ingestion library - Beam pipelines, transforms, file management.

Depends on: gcp-pipeline-core
NO Apache Airflow dependency.

Architecture

                         GCP-PIPELINE-BEAM
                         ─────────────────

  ┌─────────────────────────────────────────────────────────────────┐
  │                     INGESTION LAYER                              │
  │                                                                  │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                    File Management                       │    │
  │  │  • HDR/TRL Parser (header/trailer validation)           │    │
  │  │  • Split File Handler (reassemble split files)           │    │
  │  │  • File Archiver (move to archive bucket)               │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                     Validators                           │    │
  │  │  • SchemaValidator (validate against EntitySchema)      │    │
  │  │  • SSN, Date, Numeric validators                        │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                   Beam Transforms                        │    │
  │  │  • ParseCsvLine (parse CSV to dict)                     │    │
  │  │  • ValidateRecordDoFn (schema validation)               │    │
  │  │  • AddAuditColumnsDoFn (add _run_id, etc.)              │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                              │                                   │
  │                              ▼                                   │
  │  ┌─────────────────────────────────────────────────────────┐    │
  │  │                   Base Pipeline                          │    │
  │  │  • BasePipeline (abstract class)                        │    │
  │  │  • PipelineConfig, PipelineOptions                      │    │
  │  └─────────────────────────────────────────────────────────┘    │
  │                                                                  │
  └─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                       Uses: gcp-pipeline-core

Ingestion Flow

  GCS Landing              Beam Pipeline                    BigQuery
  ───────────              ─────────────                    ────────

  file.csv  ──────►  ┌─────────────────────┐
  file.csv.ok        │                     │
                     │  1. HDRTRLParser    │
                     │     • Validate HDR  │
                     │     • Validate TRL  │
                     │     • Check count   │
                     │                     │
                     │  2. ParseCsvLine    │
                     │     • CSV to dict   │
                     │                     │
                     │  3. SchemaValidator │
                     │     • Required      │────► Valid records ──► BigQuery
                     │     • Types         │
                     │     • Allowed vals  │────► Invalid ──► Error bucket
                     │                     │
                     │  4. AddAuditColumns │
                     │     • _run_id       │
                     │     • _source_file  │
                     │     • _processed_at │
                     │                     │
                     └─────────────────────┘
                              │
                              ▼
                     ┌─────────────────────┐
                     │  Archive to GCS     │
                     └─────────────────────┘

Split File Handling

The system supports processing files that have been split into multiple parts. The .ok file signals ALL splits are ready.

  GCS Landing Bucket                         Pub/Sub & Processing
  ──────────────────                         ────────────────────

  customers_1.csv  ──┐
  customers_2.csv  ──┼── (data files)
  customers_3.csv  ──┘
         │
         │
  customers.csv.ok ─────► Pub/Sub Notification
         │                      │
         │                      ▼
         │               ┌─────────────────┐
         │               │ Airflow Sensor  │
         │               │ (detects .ok)   │
         │               └────────┬────────┘
         │                        │
         │                        ▼
         │               ┌─────────────────┐
         │               │ File Discovery  │
         │               │ • List bucket   │
         │               │ • Find splits:  │
         │               │   customers_*.csv
         │               └────────┬────────┘
         │                        │
         └────────────────────────┘
                                  │
                                  ▼
                         ┌─────────────────┐
                         │ Process ALL     │
                         │ split files     │
                         │ in single job   │
                         └─────────────────┘

Split File Discovery Logic

# 1. Pub/Sub receives notification for .ok file
#    Message: {"name": "application1/customers/customers.csv.ok", "bucket": "landing"}

# 2. Sensor extracts entity name from .ok file
#    entity = "customers"  (from customers.csv.ok)

# 3. File discovery finds all matching splits
#    pattern = f"gs://landing/application1/customers/customers*.csv"
#    files = [
#        "gs://landing/application1/customers/customers_1.csv",
#        "gs://landing/application1/customers/customers_2.csv", 
#        "gs://landing/application1/customers/customers_3.csv",
#    ]

# 4. All files processed in single Dataflow job
#    pipeline.read_from_gcs(files)  # Reads all splits

Key Points

Aspect	Behavior
Trigger	Only `.ok` file triggers processing
Discovery	Pattern match: `{entity}.csv` or `{entity}_.csv`
Processing	All splits processed in single Dataflow job
Validation	Each split has own HDR/TRL - all validated
Audit	All records get same `_run_id`

Modules

Module	Purpose	Key Classes
`file_management/`	HDR/TRL parsing, archival	`HDRTRLParser`, `FileArchiver`
`validators/`	Schema-driven validation	`SchemaValidator`, `ValidationError`
`pipelines/base/`	Base classes	`BasePipeline`, `PipelineConfig`
`pipelines/beam/transforms/`	Beam DoFns	`ParseCsvLine`, `ValidateRecordDoFn`

Key Findings

1. Advanced HDR/TRL Parsing

Configurable Parser: Highly flexible regex-based parsing for header and trailer validation.
Support: Handles custom patterns, prefixes, and multi-field extraction for diverse source systems.
Validation: Automated record count and checksum verification against trailer values.

2. Fluent Pipeline API

BeamPipelineBuilder: Provides a clean, chainable interface for building pipelines:
- read_csv() / read_avro()
- validate() (Schema-driven)
- transform() (Custom business logic)
- write_to_bigquery() / write_to_gcs()

3. Schema Validation & PII Masking

SchemaValidator: Validates records against EntitySchema definitions from core.
In-flight Masking: Supports PII masking during the ingestion process, ensuring sensitive data is protected before landing in BigQuery.

4. Split File Handling

Specialized logic for reassembling and processing split files from source systems.

Governance & Compliance

Domain Isolation: Depends on core and beam; MUST NOT import airflow.
Testing: Every transform and pipeline component requires unit tests using gcp-pipeline-tester.
Reuse: Prefer using BeamPipelineBuilder for consistent pipeline construction.

Usage

from gcp_pipeline_beam.file_management import HDRTRLParser, FileArchiver
from gcp_pipeline_beam.validators import SchemaValidator
from gcp_pipeline_beam.pipelines.base import BasePipeline, PipelineConfig
from gcp_pipeline_beam.pipelines.beam.transforms import ParseCsvLine, ValidateRecordDoFn

Tests

PYTHONPATH=src:../gcp-pipeline-core/src python -m pytest tests/unit/ -v
# 358 passed

Project details

Release history Release notifications | RSS feed

1.0.29

Mar 24, 2026

1.0.28

Mar 22, 2026

1.0.26

Mar 20, 2026

1.0.24

Mar 19, 2026

1.0.23

Mar 18, 2026

1.0.22

Mar 18, 2026

1.0.21

Mar 18, 2026

1.0.20

Mar 18, 2026

1.0.19

Mar 18, 2026

1.0.18

Mar 18, 2026

1.0.17

Mar 18, 2026

1.0.15

Mar 18, 2026

1.0.14

Mar 17, 2026

1.0.13

Mar 17, 2026

1.0.11

Mar 15, 2026

1.0.10

Mar 12, 2026

1.0.9

Mar 12, 2026

1.0.8

Mar 11, 2026

1.0.7

Mar 11, 2026

1.0.6

Mar 8, 2026

1.0.5

Mar 4, 2026

1.0.4

Mar 1, 2026

This version

1.0.3

Mar 1, 2026

1.0.2

Mar 1, 2026

1.0.1

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gcp_pipeline_beam-1.0.3.tar.gz (47.3 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gcp_pipeline_beam-1.0.3-py3-none-any.whl (64.2 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file gcp_pipeline_beam-1.0.3.tar.gz.

File metadata

Download URL: gcp_pipeline_beam-1.0.3.tar.gz
Upload date: Mar 1, 2026
Size: 47.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for gcp_pipeline_beam-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`0218724095c9ad13e6473546c86362b88a41cadf48fc538a6a76b0cb314e2c57`
MD5	`89f192e31e5b1980a40396c2686044e6`
BLAKE2b-256	`3f7608f3a8059ac9133d7bedf8d9100823fb91de4199573eec13dd108d920d36`

See more details on using hashes here.

File details

Details for the file gcp_pipeline_beam-1.0.3-py3-none-any.whl.

File metadata

Download URL: gcp_pipeline_beam-1.0.3-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 64.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for gcp_pipeline_beam-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db4eb1e940364b787f62383ac85614d43bcfed6600f684f8de7822871fcfcd0f`
MD5	`a8fbd241ac752365457f828b7d4dfbcf`
BLAKE2b-256	`c7f92ce16c3eac510daf0cd2af929a080eeb68810a2e31768a884865fcfd6d83`

See more details on using hashes here.

gcp-pipeline-beam 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

gcp-pipeline-beam

Architecture

Ingestion Flow

Split File Handling

Split File Discovery Logic

Key Points

Modules

Key Findings

1. Advanced HDR/TRL Parsing

2. Fluent Pipeline API

3. Schema Validation & PII Masking

4. Split File Handling

Governance & Compliance

Usage

Tests

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes