SDMF - Standard Data Management Framework

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

Standard Data Management Framework (SDMF)

A modular, scalable, and Python-based Data Management Framework designed to standardize data ingestion, validation, transformation, metadata handling, and storage across enterprise workflows.

This framework eliminates repetitive boilerplate and provides a consistent structure for building reliable, maintainable data pipelines.

About

Created and maintained by Harsh Handoo, Data Engineer, SDMF is designed to standardize common data movement patterns and reduce boilerplate in real-world Spark workloads.

SDMF (Standard Data Management Framework) is an open-source Spark-based data engineering framework built for reliable, production-grade data pipelines. It focuses on schema enforcement, incremental processing, and SCD Type-2 handling using Delta Lake.

Features

Modular Design – Plug-and-play components for ingestion, validation, transformation, and storage.
Schema Alignment & Partitioning – Built-in support for CDC (Change Data Capture) and MERGE operations.
Metadata Management – Centralized handling of feed specifications and lineage.
Scalable – Works seamlessly with Spark, Delta Lake, and distributed environments like Databricks.
Logging & Monitoring – Custom logging with retention and rotation policies.

Installation

pip install sdmf

Requirements

Cluster Resources (Typical)

Workload	Minimum	Recommended
Local development	4 vCPU, 8 GB RAM	8 vCPU, 16 GB RAM
Small datasets (<10M rows)	2 executors × 4 GB	4 executors × 8 GB
Medium datasets (10–100M rows)	4 executors × 8 GB	8 executors × 16 GB
Large datasets (>100M rows)	8+ executors × 16 GB	Cluster-specific tuning

Recommended Production Setup

Linux-based Spark cluster
Spark FAIR scheduler enabled
Delta Lake tables stored on cloud object storage
Versioned releases via PyPI + GitHub Releases

Storage

Local filesystem (dev only)
HDFS / ADLS / S3 / GCS (recommended)
DBFS (Databricks)

Operating System

Linux (recommended)
macOS
Windows (WSL recommended for local development)

⚠️ Production deployments are strongly recommended on Linux-based systems.

Note: This library is tested on databricks.

Usage

Prerequisites

Dedicate a directory to SDMF. Example: /sdmf_dir/

Setup config.ini file.

[DEFAULT]
outbound_directory_name=sdmf_outbound
log_directory_name=sdmf_logs
temp_log_location=/sdmf_dir/temp
file_hunt_path=/sdmf_dir/
log_retention_policy_in_days=7
max_concurrent_batches=4

[FILES]
master_spec_name = master_specs.xlsx

[LINEAGE_DIAGRAM]
BOX_WIDTH=4.4
BOX_HEIGHT=2.2
X_GAP=2.0
Y_GAP=2.5
ROOT_GAP=2.0

Setup Master Spec master_spec.xlsx (can be renamed in config) file.
- feed_id
- system_name
- subsystem_name
- category
- sub_category
- data_flow_direction
- residing_layer
- feed_name
- feed_type
- feed_specs
- load_type
- target_unity_catalog
- target_schema_name
- target_table_name
- suggested_feed_name
- parallelism_group_number
- parent_feed_id
- is_active

Feed Spec JSON

{
    "primary_key": "col1",
    "composite_key": [],
    "partition_keys": [],
    "vacuum_hours": 168,
    "source_table_name": "test.test",
    "selection_query":null,
    "selection_schema": {
        "type": "struct",
        "fields": [
            {
                "name": "col1",
                "type": "string",
                "nullable": true,
                "metadata": {
                    "comment": "test"
                }
            },
            {
                "name": "col2",
                "type": "string",
                "nullable": true,
                "metadata": {
                    "comment": "test"
                }
            },
            {
                "name": "col3",
                "type": "string",
                "nullable": true,
                "metadata": {
                    "comment": "test"
                }
            },
            {
                "name": "col4",
                "type": "string",
                "nullable": true,
                "metadata": {
                    "comment": "test"
                }
            }
        ]
    },
    "standard_checks": [
        {
            "check_sequence": [
                "_check_primary_key"
            ],
            "column_name": "col1",
            "threshold": 0
        },
        {
            "check_sequence": [
                "_check_nulls"
            ],
            "column_name": "col2",
            "threshold": 0
        }
    ],
    "comprehensive_checks": [
        {
            "check_name": "Some unique check name",
            "query": "Select 1;",
            "severity": "WARNING",
            "threshold": 0,
            "load_stage": "PRE_LOAD",
            "dependency_dataset": []
        },
        {
            "check_name": "Some unique check name 1",
            "query": "Select 1;",
            "severity": "WARNING",
            "threshold": 0,
            "load_stage": "PRE_LOAD",
            "dependency_dataset": []
        },
        {
            "check_name": "Some unique check name 2",
            "query": "Select 1;",
            "severity": "WARNING",
            "threshold": 0,
            "load_stage": "PRE_LOAD",
            "dependency_dataset": []
        },
        {
            "check_name": "Some unique check name 3",
            "query": "Select 1;",
            "severity": "WARNING",
            "threshold": 0,
            "load_stage": "POST_LOAD",
            "dependency_dataset": [
                "demo.customers"
            ]
        }
    ]
}

Ensure Spark FAIR scheduler is enabled.

#!/bin/bash

echo "Configuring Spark FAIR scheduler..."

cat <<EOF >> /databricks/spark/conf/spark-defaults.conf
spark.scheduler.mode FAIR
EOF

echo "Spark FAIR scheduler enabled."

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
        .appName("SDMF")
        .config("spark.scheduler.mode", "FAIR")
        .getOrCreate()
)

Execution

import configparser
from sdmf import Orchestrator

spark # available spark session

cfg = configparser.ConfigParser()
cfg.read("/sdmf_dir/config.ini")
myOrchestrator = Orchestrator(spark, config=cfg)
myOrchestrator.run()

Logging

Logs are first written to specified log directory in config.ini.

Project details

Release history Release notifications | RSS feed

0.1.16

Feb 7, 2026

0.1.15

Feb 7, 2026

0.1.14

Feb 7, 2026

0.1.13

Feb 7, 2026

0.1.12

Feb 7, 2026

0.1.11

Feb 7, 2026

0.1.10

Feb 7, 2026

0.1.9

Feb 6, 2026

0.1.8

Feb 5, 2026

0.1.7

Feb 4, 2026

0.1.6

Feb 3, 2026

0.1.5

Feb 3, 2026

0.1.4

Feb 3, 2026

This version

0.1.3

Feb 3, 2026

0.1.2

Feb 3, 2026

0.1.1

Jan 28, 2026

0.1.0

Jan 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmf-0.1.3.tar.gz (41.1 kB view details)

Uploaded Feb 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sdmf-0.1.3-py3-none-any.whl (62.2 kB view details)

Uploaded Feb 3, 2026 Python 3

File details

Details for the file sdmf-0.1.3.tar.gz.

File metadata

Download URL: sdmf-0.1.3.tar.gz
Upload date: Feb 3, 2026
Size: 41.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for sdmf-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`5c313300800129beab6ba8971d6fa6d6554362368253b2e754824b71543f5874`
MD5	`aa5e2502d60d6f02a98d2ddb5fee2066`
BLAKE2b-256	`ed670141add5700e8bc66e0e3a9c7c6fbc8c1337f5ccc44947d10b630e78b895`

See more details on using hashes here.

File details

Details for the file sdmf-0.1.3-py3-none-any.whl.

File metadata

Download URL: sdmf-0.1.3-py3-none-any.whl
Upload date: Feb 3, 2026
Size: 62.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for sdmf-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f868e8a111937633ecdb2e13992a778ca4223a2dfac7ec967e2f49973181d1b`
MD5	`0edc84528248c65255e7d98492bd43c3`
BLAKE2b-256	`f65372ff6e7c540d226fbae1acffc7ef7f1a8afc257d25dd31f205a6ec101a17`

See more details on using hashes here.

sdmf 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Standard Data Management Framework (SDMF)

About

Features

Installation

Requirements

Cluster Resources (Typical)

Recommended Production Setup

Storage

Operating System

Usage

Prerequisites

Execution

Logging

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes