SDMF - Standard Data Management Framework
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
Standard Data Management Framework (SDMF)
A modular, scalable, and Python-based Data Management Framework designed to standardize data ingestion, validation, transformation, metadata handling, and storage across enterprise workflows.
This framework eliminates repetitive boilerplate and provides a consistent structure for building reliable, maintainable data pipelines.
About
Created and maintained by Harsh Handoo, Data Engineer, SDMF is designed to standardize common data movement patterns and reduce boilerplate in real-world Spark workloads.
SDMF (Standard Data Management Framework) is an open-source Spark-based data engineering framework built for reliable, production-grade data pipelines. It focuses on schema enforcement, incremental processing, and SCD Type-2 handling using Delta Lake.
Features
- Modular Design – Plug-and-play components for ingestion, validation, transformation, and storage.
- Schema Alignment & Partitioning – Built-in support for CDC (Change Data Capture) and MERGE operations.
- Metadata Management – Centralized handling of feed specifications and lineage.
- Scalable – Works seamlessly with Spark, Delta Lake, and distributed environments like Databricks.
- Logging & Monitoring – Custom logging with retention and rotation policies.
Installation
pip install sdmf
Requirements
Cluster Resources (Typical)
| Workload | Minimum | Recommended |
|---|---|---|
| Local development | 4 vCPU, 8 GB RAM | 8 vCPU, 16 GB RAM |
| Small datasets (<10M rows) | 2 executors × 4 GB | 4 executors × 8 GB |
| Medium datasets (10–100M rows) | 4 executors × 8 GB | 8 executors × 16 GB |
| Large datasets (>100M rows) | 8+ executors × 16 GB | Cluster-specific tuning |
Recommended Production Setup
- Linux-based Spark cluster
- Spark FAIR scheduler enabled
- Delta Lake tables stored on cloud object storage
- Versioned releases via PyPI + GitHub Releases
Storage
- Local filesystem (dev only)
- HDFS / ADLS / S3 / GCS (recommended)
- DBFS (Databricks)
Operating System
- Linux (recommended)
- macOS
- Windows (WSL recommended for local development)
⚠️ Production deployments are strongly recommended on Linux-based systems.
Note: This library is tested on databricks.
Usage
Prerequisites
-
Dedicate a directory to SDMF. Example: /sdmf_dir/
-
Setup
config.inifile.[DEFAULT] outbound_directory_name=sdmf_outbound log_directory_name=sdmf_logs temp_log_location=/sdmf_dir/temp file_hunt_path=/sdmf_dir/ log_retention_policy_in_days=7 max_concurrent_batches=4 [FILES] master_spec_name = master_specs.xlsx [LINEAGE_DIAGRAM] BOX_WIDTH=4.4 BOX_HEIGHT=2.2 X_GAP=2.0 Y_GAP=2.5 ROOT_GAP=2.0
-
Setup Master Spec
master_spec.xlsx(can be renamed in config) file.- feed_id
- system_name
- subsystem_name
- category
- sub_category
- data_flow_direction
- residing_layer
- feed_name
- feed_type
- feed_specs
- load_type
- target_unity_catalog
- target_schema_name
- target_table_name
- suggested_feed_name
- parallelism_group_number
- parent_feed_id
- is_active
-
Feed Spec JSON
{ "primary_key": "col1", "composite_key": [], "partition_keys": [], "vacuum_hours": 168, "source_table_name": "test.test", "selection_query":null, "selection_schema": { "type": "struct", "fields": [ { "name": "col1", "type": "string", "nullable": true, "metadata": { "comment": "test" } }, { "name": "col2", "type": "string", "nullable": true, "metadata": { "comment": "test" } }, { "name": "col3", "type": "string", "nullable": true, "metadata": { "comment": "test" } }, { "name": "col4", "type": "string", "nullable": true, "metadata": { "comment": "test" } } ] }, "standard_checks": [ { "check_sequence": [ "_check_primary_key" ], "column_name": "col1", "threshold": 0 }, { "check_sequence": [ "_check_nulls" ], "column_name": "col2", "threshold": 0 } ], "comprehensive_checks": [ { "check_name": "Some unique check name", "query": "Select 1;", "severity": "WARNING", "threshold": 0, "load_stage": "PRE_LOAD", "dependency_dataset": [] }, { "check_name": "Some unique check name 1", "query": "Select 1;", "severity": "WARNING", "threshold": 0, "load_stage": "PRE_LOAD", "dependency_dataset": [] }, { "check_name": "Some unique check name 2", "query": "Select 1;", "severity": "WARNING", "threshold": 0, "load_stage": "PRE_LOAD", "dependency_dataset": [] }, { "check_name": "Some unique check name 3", "query": "Select 1;", "severity": "WARNING", "threshold": 0, "load_stage": "POST_LOAD", "dependency_dataset": [ "demo.customers" ] } ] }
-
Ensure Spark FAIR scheduler is enabled.
#!/bin/bash echo "Configuring Spark FAIR scheduler..." cat <<EOF >> /databricks/spark/conf/spark-defaults.conf spark.scheduler.mode FAIR EOF echo "Spark FAIR scheduler enabled."
from pyspark.sql import SparkSession spark = ( SparkSession.builder .appName("SDMF") .config("spark.scheduler.mode", "FAIR") .getOrCreate() )
Execution
import configparser
from sdmf import Orchestrator
spark # available spark session
cfg = configparser.ConfigParser()
cfg.read("/sdmf_dir/config.ini")
myOrchestrator = Orchestrator(spark, config=cfg)
myOrchestrator.run()
Logging
- Logs are first written to specified log directory in
config.ini.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sdmf-0.1.5.tar.gz.
File metadata
- Download URL: sdmf-0.1.5.tar.gz
- Upload date:
- Size: 41.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b7cc94b77e3e587ee452e8940e6e95aa040e6a4f4fe9e4da715664d5d54c93c
|
|
| MD5 |
c90cb982c65ac2df644d6c0f91a52564
|
|
| BLAKE2b-256 |
0ac63aace537344c2ebbebd97d9f5132e9002961d7cd1f21f7566f8dd2261598
|
File details
Details for the file sdmf-0.1.5-py3-none-any.whl.
File metadata
- Download URL: sdmf-0.1.5-py3-none-any.whl
- Upload date:
- Size: 62.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c2748fb1f060b90639a86524d6aaa34483947ac735c79f9b2081476070986df
|
|
| MD5 |
2b417c5142e610ded280bf630f0a26d6
|
|
| BLAKE2b-256 |
54cc03c09b3406718f8a6d36cfe86fd6f555d970152688e16c70a5c4ac0c46be
|