Skip to main content

Spark Ingestion Framework for Tables (SIFFT) - a Python library providing a consistent interface for ingesting files into Apache Spark environments.

Project description

SIFFT

Spark Ingestion Framework For Tables — format-agnostic file reading, data validation, and table writing for PySpark.

Install

pip install sifft              # Without PySpark (use your Databricks runtime)
pip install sifft[pyspark3]    # With PySpark 3.5 + Delta Lake 3.x
pip install sifft[pyspark4]    # With PySpark 4.x + Delta Lake 4.x

Quick Start

from pyspark.sql import SparkSession
from file_processing import process_file
from dataframe_validation import validate_csvw_constraints
from table_writing import write_table, TableWriteOptions

spark = SparkSession.builder.appName("Pipeline").getOrCreate()

# Read (auto-detects format, delimiter, header)
result = process_file("data.csv", spark)

# Validate (if CSVW metadata exists alongside the file)
if result.metadata:
    report = validate_csvw_constraints(result.dataframe, result.metadata)
    assert report.valid, f"{len(report.violations)} violations"

# Write
write_table(result.dataframe, "catalog.schema.target", spark,
            TableWriteOptions(format="delta", mode="append"))

Features

  • File Processing — CSV, TSV, pipe-delimited, Excel. Auto-detection of delimiters and headers. Checksum-based deduplication.
  • Data Validation — CSVW constraint checking: required, unique, min/max, pattern, enum, primary keys.
  • Table Writing — Delta, Parquet, ORC. Append, overwrite, merge/upsert, schema evolution.
  • File Management — Safe move/list across local, S3, Azure, GCS.
  • Extensibility — Custom format handlers, write modes, and constraint validators.

Documentation

Compatibility

  • Python 3.10+
  • PySpark 3.5.x, 4.0.x, or 4.1.x
  • Databricks / Unity Catalog compatible

Development

just test        # Run tests in Docker
just test-local  # Run tests locally (requires Java)
just --list      # All available commands

Design Decisions

Architecture Decision Records are in design_decisions/.

License

MIT — see LICENSE.

Contributors

  • Iwan Dyke
  • Fahad Khan
  • Michal Poreba

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sifft-0.8.2.tar.gz (96.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sifft-0.8.2-py3-none-any.whl (47.0 kB view details)

Uploaded Python 3

File details

Details for the file sifft-0.8.2.tar.gz.

File metadata

  • Download URL: sifft-0.8.2.tar.gz
  • Upload date:
  • Size: 96.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifft-0.8.2.tar.gz
Algorithm Hash digest
SHA256 a2e5f1485d83f5874065eb6d07936386fcfc957f9ce25b677ed6df38ba8b8701
MD5 19b2a19af250b1b0d8c9f0d8b73f12f6
BLAKE2b-256 6bb370722a1a293173f6f1eea04d13397bb1c1062f0073e5aea7236ded2e7b88

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifft-0.8.2.tar.gz:

Publisher: publish.yml on dvla/sifft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sifft-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: sifft-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 47.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifft-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c55b0f7316cc711304304eaa4435cae4ad6132289b2377c9b019e73436ff92c3
MD5 5889f3e3c300bd8604979cb1654dcca3
BLAKE2b-256 ca22e5aea379e08a8bce89d6f736015e0ef8ae5d6e52ab236a86e7fcdf00ea7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifft-0.8.2-py3-none-any.whl:

Publisher: publish.yml on dvla/sifft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page