Skip to main content

Spark Ingestion Framework for Tables (SIFFT) - a Python library providing a consistent interface for ingesting files into Apache Spark environments.

Project description

SIFFT

Spark Ingestion Framework For Tables — format-agnostic file reading, data validation, and table writing for PySpark.

Install

pip install sifft              # Without PySpark (use your Databricks runtime)
pip install sifft[pyspark3]    # With PySpark 3.5 + Delta Lake 3.x
pip install sifft[pyspark4]    # With PySpark 4.x + Delta Lake 4.x

Quick Start

from pyspark.sql import SparkSession
from file_processing import process_file
from dataframe_validation import validate_csvw_constraints
from table_writing import write_table, TableWriteOptions

spark = SparkSession.builder.appName("Pipeline").getOrCreate()

# Read (auto-detects format, delimiter, header)
result = process_file("data.csv", spark)

# Validate (if CSVW metadata exists alongside the file)
if result.metadata:
    report = validate_csvw_constraints(result.dataframe, result.metadata)
    assert report.valid, f"{len(report.violations)} violations"

# Write
write_table(result.dataframe, "catalog.schema.target", spark,
            TableWriteOptions(format="delta", mode="append"))

Features

  • File Processing — CSV, TSV, pipe-delimited, Excel. Auto-detection of delimiters and headers. Checksum-based deduplication.
  • Data Validation — CSVW constraint checking: required, unique, min/max, pattern, enum, primary keys.
  • Table Writing — Delta, Parquet, ORC. Append, overwrite, merge/upsert, schema evolution.
  • File Management — Safe move/list across local, S3, Azure, GCS.
  • Extensibility — Custom format handlers, write modes, and constraint validators.

Documentation

Compatibility

  • Python 3.10+
  • PySpark 3.5.x, 4.0.x, or 4.1.x
  • Databricks / Unity Catalog compatible

Development

just test        # Run tests in Docker
just test-local  # Run tests locally (requires Java)
just --list      # All available commands

Design Decisions

Architecture Decision Records are in design_decisions/.

License

MIT — see LICENSE.

Contributors

  • Iwan Dyke
  • Fahad Khan
  • Michal Poreba

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sifft-0.8.3.tar.gz (97.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sifft-0.8.3-py3-none-any.whl (47.0 kB view details)

Uploaded Python 3

File details

Details for the file sifft-0.8.3.tar.gz.

File metadata

  • Download URL: sifft-0.8.3.tar.gz
  • Upload date:
  • Size: 97.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifft-0.8.3.tar.gz
Algorithm Hash digest
SHA256 d3040d2ccfaa17cdaf9dc574101705f8fe28bfe4f88619bd28cc08cfc1f5f7e0
MD5 f9ad6327dc8a91e800a6903cde34bc8d
BLAKE2b-256 e62d2a0f9649375655645c596a9d47855647e63287d8ebe41949a834592f6261

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifft-0.8.3.tar.gz:

Publisher: publish.yml on dvla/sifft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sifft-0.8.3-py3-none-any.whl.

File metadata

  • Download URL: sifft-0.8.3-py3-none-any.whl
  • Upload date:
  • Size: 47.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifft-0.8.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c1deb8511c78b4994c122cf490248a7b5a4ce79c129de2b113750a5f847f7c0d
MD5 e770899cde816d1d689b1a6a60676fd5
BLAKE2b-256 505436cdaa03952824fb736ea2312400eb75a67178ee80f4147ee137758f6d06

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifft-0.8.3-py3-none-any.whl:

Publisher: publish.yml on dvla/sifft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page