Spark Ingestion Framework for Tables (SIFFT) - a Python library providing a consistent interface for ingesting files into Apache Spark environments.
Project description
SIFFT
Spark Ingestion Framework For Tables — format-agnostic file reading, data validation, and table writing for PySpark.
Install
pip install sifft # Without PySpark (use your Databricks runtime)
pip install sifft[pyspark3] # With PySpark 3.5 + Delta Lake 3.x
pip install sifft[pyspark4] # With PySpark 4.x + Delta Lake 4.x
Quick Start
from pyspark.sql import SparkSession
from file_processing import process_file
from dataframe_validation import validate_csvw_constraints
from table_writing import write_table, TableWriteOptions
spark = SparkSession.builder.appName("Pipeline").getOrCreate()
# Read (auto-detects format, delimiter, header)
result = process_file("data.csv", spark)
# Validate (if CSVW metadata exists alongside the file)
if result.metadata:
report = validate_csvw_constraints(result.dataframe, result.metadata)
assert report.valid, f"{len(report.violations)} violations"
# Write
write_table(result.dataframe, "catalog.schema.target", spark,
TableWriteOptions(format="delta", mode="append"))
Features
- File Processing — CSV, TSV, pipe-delimited, Excel. Auto-detection of delimiters and headers. Checksum-based deduplication.
- Data Validation — CSVW constraint checking: required, unique, min/max, pattern, enum, primary keys.
- Table Writing — Delta, Parquet, ORC. Append, overwrite, merge/upsert, schema evolution.
- File Management — Safe move/list across local, S3, Azure, GCS.
- Extensibility — Custom format handlers, write modes, and constraint validators.
Documentation
- Getting Started — installation, requirements, full quick start
- File Processing — reading files, tracking, deduplication
- Data Validation — schema validation and CSVW constraints
- Table Writing — write modes, formats, schema evolution
- File Management — safe move/list with cloud support
- CSVW Metadata — metadata format and constraint reference
- Extensibility — custom handlers and validators
- AutoLoader Integration — using SIFFT with Databricks AutoLoader
Compatibility
- Python 3.10+
- PySpark 3.5.x, 4.0.x, or 4.1.x
- Databricks / Unity Catalog compatible
Development
just test # Run tests in Docker
just test-local # Run tests locally (requires Java)
just --list # All available commands
Design Decisions
Architecture Decision Records are in design_decisions/.
License
MIT — see LICENSE.
Contributors
- Iwan Dyke
- Fahad Khan
- Michal Poreba
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sifft-0.8.2.tar.gz.
File metadata
- Download URL: sifft-0.8.2.tar.gz
- Upload date:
- Size: 96.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2e5f1485d83f5874065eb6d07936386fcfc957f9ce25b677ed6df38ba8b8701
|
|
| MD5 |
19b2a19af250b1b0d8c9f0d8b73f12f6
|
|
| BLAKE2b-256 |
6bb370722a1a293173f6f1eea04d13397bb1c1062f0073e5aea7236ded2e7b88
|
Provenance
The following attestation bundles were made for sifft-0.8.2.tar.gz:
Publisher:
publish.yml on dvla/sifft
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sifft-0.8.2.tar.gz -
Subject digest:
a2e5f1485d83f5874065eb6d07936386fcfc957f9ce25b677ed6df38ba8b8701 - Sigstore transparency entry: 1519067017
- Sigstore integration time:
-
Permalink:
dvla/sifft@e3bc7294a9e70976db65ebf0995076a473ea2ea0 -
Branch / Tag:
refs/tags/v0.8.2 - Owner: https://github.com/dvla
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e3bc7294a9e70976db65ebf0995076a473ea2ea0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file sifft-0.8.2-py3-none-any.whl.
File metadata
- Download URL: sifft-0.8.2-py3-none-any.whl
- Upload date:
- Size: 47.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c55b0f7316cc711304304eaa4435cae4ad6132289b2377c9b019e73436ff92c3
|
|
| MD5 |
5889f3e3c300bd8604979cb1654dcca3
|
|
| BLAKE2b-256 |
ca22e5aea379e08a8bce89d6f736015e0ef8ae5d6e52ab236a86e7fcdf00ea7b
|
Provenance
The following attestation bundles were made for sifft-0.8.2-py3-none-any.whl:
Publisher:
publish.yml on dvla/sifft
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sifft-0.8.2-py3-none-any.whl -
Subject digest:
c55b0f7316cc711304304eaa4435cae4ad6132289b2377c9b019e73436ff92c3 - Sigstore transparency entry: 1519067023
- Sigstore integration time:
-
Permalink:
dvla/sifft@e3bc7294a9e70976db65ebf0995076a473ea2ea0 -
Branch / Tag:
refs/tags/v0.8.2 - Owner: https://github.com/dvla
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e3bc7294a9e70976db65ebf0995076a473ea2ea0 -
Trigger Event:
release
-
Statement type: