A modular framework for data engineering, validation, and analytics workflows.

These details have not been verified by PyPI

Project links

Project description

Data Manager

Data Manager is a modular Python framework for data engineering, validation, and analytics workflows. It provides a unified, job-based architecture on top of interchangeable storage backends, enabling structured and reproducible processing of tabular datasets.

Features
Installation
Architecture Overview
Quick Start
Storage Backends
Jobs
Example Pipeline
Running Tests
Roadmap
Contributing
License

Features

Pluggable storage backends — swap between CSV, JSON, and in-memory storage with a consistent interface
Job-based execution model — cleanly separates engineering, validation, and analytics concerns
Data engineering utilities — remove duplicates, handle missing values
Data validation — schema validation, data type checks, and nullability checks
Data analytics & profiling — summary statistics, column-level analysis, missing value reports, and full dataset profiling
Fully tested — comprehensive test suite using pytest with coverage reporting

Installation

Install from PyPI:

pip install data-manager-framework

Requirements: Python ≥ 3.12, pandas, numpy

Architecture Overview

Data Manager is built around two core concepts:

Storage Backends — responsible for reading, holding, and writing data. All backends expose a consistent interface (read, write, data), so jobs are fully decoupled from the underlying file format.
Jobs — stateless workers that accept a storage object and operate on its data. The three built-in job classes are DataEngineer, DataValidation, and DataAnalytics.

┌─────────────────────────────────────────────────────┐
│                    Your Application                  │
└──────────────────────────┬──────────────────────────┘
                           │
              ┌────────────▼────────────┐
              │     Storage Backend     │
              │  CSVStorage             │
              │  JSONStorage            │
              │  InMemoryStorage        │
              └────────────┬────────────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
┌────────▼───────┐ ┌───────▼──────┐ ┌───────▼────────┐
│  DataEngineer  │ │DataValidation│ │ DataAnalytics  │
│  - duplicates  │ │  - schema    │ │  - summary     │
│  - null values │ │  - types     │ │  - profiling   │
└────────────────┘ │  - nulls     │ │  - statistics  │
                   └──────────────┘ └────────────────┘

Quick Start

from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_analytics import DataAnalytics

# Load data
storage = CSVStorage()
storage.load("data.csv")

# Run analytics
analytics = DataAnalytics(storage)
print(analytics.summary())

Storage Backends

All storage backends share the same interface. You can swap one for another without changing any of your job code.

CSVStorage

Reads and writes CSV files using pandas.

from data_manager.storage.csv_backend import CSVStorage

storage = CSVStorage()
storage.load("data.csv")

# Access the underlying DataFrame
print(storage.data.head())

# Write back to disk
storage.write("output.csv")

JSONStorage

Reads and writes JSON files.

from data_manager.storage.json_backend import JSONStorage

storage = JSONStorage()
storage.load("data.json")
storage.write("output.json")

InMemoryStorage

Holds data in memory — useful for testing or for passing a pre-built pandas DataFrame directly into the job pipeline.

import pandas as pd
from data_manager.storage.In_memory_backend import InMemoryStorage

df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [30, 25]})

storage = InMemoryStorage()
storage.data = df

Jobs

Jobs accept a storage object in their constructor and operate on storage.data. They do not read or write files themselves — that is the storage layer's responsibility.

DataEngineer

Handles data cleaning and preprocessing tasks.

from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_engineer import DataEngineer

storage = CSVStorage()
storage.load("raw_data.csv")

engineer = DataEngineer(storage)

# Remove duplicate rows
engineer.removeDuplicates()

# Drop rows with any null values
engineer.removeNull()

# Persist the cleaned dataset
storage.write("cleaned_data.csv")

Method	Description
`removeDuplicates()`	Drops all fully duplicate rows from the dataset
`removeNull()`	Drops all rows containing one or more null values

DataValidation

Validates the structure and content of a dataset against expected rules.

from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_validator import DataValidator

storage = CSVStorage()
storage.load("data.csv")

validator = DataValidator(storage)

# Check for required columns and expected types
validator.validateSchema({"name": "object", "age": "int64", "salary": "float64"})

# Check that specific columns have no null values
validator.checkNullability(["name", "age"])

Method	Description
`validateSchema(schema)`	Validates that columns exist and match their expected dtypes
`checkNullability(columns)`	Raises or reports when specified columns contain null values

DataAnalytics

Profiles and summarises a loaded dataset.

from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_analytics import DataAnalytics

storage = CSVStorage()
storage.load("data.csv")

analytics = DataAnalytics(storage)

# High-level dataset summary (shape, dtypes, null counts)
print(analytics.summary())

# Per-column descriptive statistics
print(analytics.columnStats())

# Count and percentage of missing values per column
print(analytics.missingValueAnalysis())

# Count and percentage of duplicate rows
print(analytics.duplicateAnalysis())

# Full dataset profile combining all of the above
print(analytics.profile())

Method	Description
`summary()`	Returns shape, column names, dtypes, and null counts
`columnStats()`	Returns descriptive statistics for each column
`missingValueAnalysis()`	Returns missing value counts and percentages per column
`duplicateAnalysis()`	Returns the number and percentage of duplicate rows
`profile()`	Returns a comprehensive profile of the entire dataset

Example Pipeline

The following example shows a complete end-to-end workflow — loading raw data, cleaning it, validating it, and profiling the result.

from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_engineer import DataEngineer
from data_manager.jobs.data_validator import DataValidator
from data_manager.jobs.data_analytics import DataAnalytics

# --- Step 1: Load ---
storage = CSVStorage()
storage.load("raw_data.csv")

# --- Step 2: Engineer ---
engineer = DataEngineer(storage)
engineer.removeDuplicates()
engineer.removeNull()

# --- Step 3: Validate ---
validator = DataValidator(storage)
validator.validateSchema({"name": "object", "age": "int64"})
validator.checkNullability(["name"])

# --- Step 4: Analyse ---
analytics = DataAnalytics(storage)
print(analytics.profile())

# --- Step 5: Save ---
storage.write("cleaned_data.csv")

Running Tests

The full test suite is written with pytest and includes coverage reporting.

Run all tests:

pytest

Run with verbose output:

pytest -v

Run with coverage report:

pytest --cov=data_manager

Roadmap

Status	Feature
✅ Done	CSV Backend
✅ Done	JSON Backend
✅ Done	In-Memory Backend
✅ Done	Data Validation
✅ Done	Data Analytics & Profiling
🔲 Planned	Excel Backend
🔲 Planned	Parquet Backend
🔲 Planned	SQL Backend
🔲 Planned	Automated EDA Reports

Contributing

Contributions are welcome. To get started:

Fork the repository on GitHub.

Clone your fork and create a new branch:

git checkout -b feature/your-feature-name

Install the project in editable mode:
```
pip install -e .
```
Make your changes and ensure all tests pass:
```
pytest -v
```
Open a pull request against the main branch with a clear description of your changes.

Please keep pull requests focused and scoped to a single concern. Bug fixes, new storage backends, and additional job methods are all welcome.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Krish Kumar — GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jun 18, 2026

0.1.0

Jun 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_manager_framework-0.2.0.tar.gz (15.7 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_manager_framework-0.2.0-py3-none-any.whl (17.0 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file data_manager_framework-0.2.0.tar.gz.

File metadata

Download URL: data_manager_framework-0.2.0.tar.gz
Upload date: Jun 18, 2026
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for data_manager_framework-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a83d092dd5c1d3a78225c272139c7876fb0b1fd94dd040ea4dab7573c27103fa`
MD5	`f27ddc06283bda4f6a06189be830429b`
BLAKE2b-256	`a4ff4d9a6c8caeaaf3209fdf8996cd227f064a1eca4e3a911112092ccced212b`

See more details on using hashes here.

File details

Details for the file data_manager_framework-0.2.0-py3-none-any.whl.

File metadata

Download URL: data_manager_framework-0.2.0-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for data_manager_framework-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ddfa2323eb141bd359bacbda2115a95a8557d12e4bde5e8bdfa22ca39e1695d`
MD5	`c1f48d6d199469b5c62199cd29ef4308`
BLAKE2b-256	`3f1224ceb390c1a0924f1d9a1bbf2425748e7a0bae5382dab947c55fdd6993f9`

See more details on using hashes here.

data-manager-framework 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Manager

Table of Contents

Features

Installation

Architecture Overview

Quick Start

Storage Backends

CSVStorage

JSONStorage

InMemoryStorage

Jobs

DataEngineer

DataValidation

DataAnalytics

Example Pipeline

Running Tests

Roadmap

Contributing

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes