Skip to main content

Spark Ingestion Framework for Tables (SIFFT) - a Python library providing a consistent interface for ingesting files into Apache Spark environments.

Project description

SIFFT

File ingestion pipelines are repetitive and fragile. SIFFT (Spark Ingestion Framework For Tables) is a Python library that aims to:

  • Make file reading format-agnostic with automatic delimiter and header detection
  • Validate data quality with CSVW metadata and constraint checking before it enters your tables
  • Provide consistent error handling across CSV, Excel, TSV, and pipe-delimited files
  • Support schema evolution and merge/upsert operations for Delta tables
  • Reduce boilerplate code for common Spark ingestion patterns

Install

pip install sifft              # Without PySpark (use your Databricks runtime)
pip install sifft[pyspark3]    # With PySpark 3.5 + Delta Lake 3.x
pip install sifft[pyspark4]    # With PySpark 4.x + Delta Lake 4.x

Quick Start

from pyspark.sql import SparkSession
from file_processing import process_file
from dataframe_validation import validate_csvw_constraints
from table_writing import write_table, TableWriteOptions

spark = SparkSession.builder.appName("Pipeline").getOrCreate()

# Read (auto-detects format, delimiter, header)
result = process_file("data.csv", spark)

# Validate (if CSVW metadata exists alongside the file)
if result.metadata:
    report = validate_csvw_constraints(result.dataframe, result.metadata)
    assert report.valid, f"{len(report.violations)} violations"

# Write
write_table(result.dataframe, "catalog.schema.target", spark,
            TableWriteOptions(format="delta", mode="append"))

Features

  • File Processing — CSV, TSV, pipe-delimited, Excel. Auto-detection of delimiters and headers. Checksum-based deduplication.
  • Data Validation — CSVW constraint checking: required, unique, min/max, pattern, enum, primary keys.
  • Table Writing — Delta, Parquet, ORC. Append, overwrite, merge/upsert, schema evolution.
  • File Management — Safe move/list across local, S3, Azure, GCS.
  • Extensibility — Custom format handlers, write modes, and constraint validators.

Documentation

Compatibility

  • Python 3.10+
  • PySpark 3.5.x, 4.0.x, or 4.1.x
  • Databricks / Unity Catalog compatible

Development

just test        # Run tests in Docker
just test-local  # Run tests locally (requires Java)
just --list      # All available commands

Design Decisions

Architecture Decision Records are in design_decisions/.

License

MIT — see LICENSE.

Contributors

  • Iwan Dyke
  • Fahad Khan
  • Michal Poreba

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sifft-0.9.0.tar.gz (101.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sifft-0.9.0-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file sifft-0.9.0.tar.gz.

File metadata

  • Download URL: sifft-0.9.0.tar.gz
  • Upload date:
  • Size: 101.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifft-0.9.0.tar.gz
Algorithm Hash digest
SHA256 62adedcef1b090621038bc5ef3468a7412880448406c80299a7f76ffdd33fb30
MD5 39b56f30d0cbafa094796b29df681af1
BLAKE2b-256 1e7ff7dd1b9fa49cd364eb954d4fc8c130de5b2dfd9cd85b84c72d9e645ae7ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifft-0.9.0.tar.gz:

Publisher: publish.yml on dvla/sifft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sifft-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: sifft-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifft-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 71a050305943103b91ac604e24fe39d7b936a827b9c5d96cd5fc3bdfa15f9069
MD5 b55064f9be9b9853f56dac235947a89a
BLAKE2b-256 288999e421970b838ae361e9c63011b76c7286ac2ba8ec860fba7d5572b95810

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifft-0.9.0-py3-none-any.whl:

Publisher: publish.yml on dvla/sifft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page