A modular framework for data engineering, validation, and analytics workflows.
Project description
Data Manager
Data Manager is a modular Python framework for data engineering, validation, and analytics workflows. It provides a unified, job-based architecture on top of interchangeable storage backends, enabling structured and reproducible processing of tabular datasets.
Table of Contents
- Features
- Installation
- Architecture Overview
- Quick Start
- Storage Backends
- Jobs
- Example Pipeline
- Running Tests
- Roadmap
- Contributing
- License
Features
- Pluggable storage backends — swap between CSV, JSON, and in-memory storage with a consistent interface
- Job-based execution model — cleanly separates engineering, validation, and analytics concerns
- Data engineering utilities — remove duplicates, handle missing values
- Data validation — schema validation, data type checks, and nullability checks
- Data analytics & profiling — summary statistics, column-level analysis, missing value reports, and full dataset profiling
- Fully tested — comprehensive test suite using
pytestwith coverage reporting
Installation
Install from PyPI:
pip install data-manager-framework
Requirements: Python ≥ 3.12, pandas, numpy
Architecture Overview
Data Manager is built around two core concepts:
-
Storage Backends — responsible for reading, holding, and writing data. All backends expose a consistent interface (
read,write,data), so jobs are fully decoupled from the underlying file format. -
Jobs — stateless workers that accept a storage object and operate on its data. The three built-in job classes are
DataEngineer,DataValidation, andDataAnalytics.
┌─────────────────────────────────────────────────────┐
│ Your Application │
└──────────────────────────┬──────────────────────────┘
│
┌────────────▼────────────┐
│ Storage Backend │
│ CSVStorage │
│ JSONStorage │
│ InMemoryStorage │
└────────────┬────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────────▼───────┐ ┌───────▼──────┐ ┌───────▼────────┐
│ DataEngineer │ │DataValidation│ │ DataAnalytics │
│ - duplicates │ │ - schema │ │ - summary │
│ - null values │ │ - types │ │ - profiling │
└────────────────┘ │ - nulls │ │ - statistics │
└──────────────┘ └────────────────┘
Quick Start
from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_analytics import DataAnalytics
# Load data
storage = CSVStorage()
storage.load("data.csv")
# Run analytics
analytics = DataAnalytics(storage)
print(analytics.summary())
Storage Backends
All storage backends share the same interface. You can swap one for another without changing any of your job code.
CSVStorage
Reads and writes CSV files using pandas.
from data_manager.storage.csv_backend import CSVStorage
storage = CSVStorage()
storage.load("data.csv")
# Access the underlying DataFrame
print(storage.data.head())
# Write back to disk
storage.write("output.csv")
JSONStorage
Reads and writes JSON files.
from data_manager.storage.json_backend import JSONStorage
storage = JSONStorage()
storage.load("data.json")
storage.write("output.json")
InMemoryStorage
Holds data in memory — useful for testing or for passing a pre-built pandas DataFrame directly into the job pipeline.
import pandas as pd
from data_manager.storage.In_memory_backend import InMemoryStorage
df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [30, 25]})
storage = InMemoryStorage()
storage.data = df
Jobs
Jobs accept a storage object in their constructor and operate on storage.data. They do not read or write files themselves — that is the storage layer's responsibility.
DataEngineer
Handles data cleaning and preprocessing tasks.
from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_engineer import DataEngineer
storage = CSVStorage()
storage.load("raw_data.csv")
engineer = DataEngineer(storage)
# Remove duplicate rows
engineer.removeDuplicates()
# Drop rows with any null values
engineer.removeNull()
# Persist the cleaned dataset
storage.write("cleaned_data.csv")
| Method | Description |
|---|---|
removeDuplicates() |
Drops all fully duplicate rows from the dataset |
removeNull() |
Drops all rows containing one or more null values |
DataValidation
Validates the structure and content of a dataset against expected rules.
from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_validator import DataValidator
storage = CSVStorage()
storage.load("data.csv")
validator = DataValidator(storage)
# Check for required columns and expected types
validator.validateSchema({"name": "object", "age": "int64", "salary": "float64"})
# Check that specific columns have no null values
validator.checkNullability(["name", "age"])
| Method | Description |
|---|---|
validateSchema(schema) |
Validates that columns exist and match their expected dtypes |
checkNullability(columns) |
Raises or reports when specified columns contain null values |
DataAnalytics
Profiles and summarises a loaded dataset.
from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_analytics import DataAnalytics
storage = CSVStorage()
storage.load("data.csv")
analytics = DataAnalytics(storage)
# High-level dataset summary (shape, dtypes, null counts)
print(analytics.summary())
# Per-column descriptive statistics
print(analytics.columnStats())
# Count and percentage of missing values per column
print(analytics.missingValueAnalysis())
# Count and percentage of duplicate rows
print(analytics.duplicateAnalysis())
# Full dataset profile combining all of the above
print(analytics.profile())
| Method | Description |
|---|---|
summary() |
Returns shape, column names, dtypes, and null counts |
columnStats() |
Returns descriptive statistics for each column |
missingValueAnalysis() |
Returns missing value counts and percentages per column |
duplicateAnalysis() |
Returns the number and percentage of duplicate rows |
profile() |
Returns a comprehensive profile of the entire dataset |
Example Pipeline
The following example shows a complete end-to-end workflow — loading raw data, cleaning it, validating it, and profiling the result.
from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_engineer import DataEngineer
from data_manager.jobs.data_validator import DataValidator
from data_manager.jobs.data_analytics import DataAnalytics
# --- Step 1: Load ---
storage = CSVStorage()
storage.load("raw_data.csv")
# --- Step 2: Engineer ---
engineer = DataEngineer(storage)
engineer.removeDuplicates()
engineer.removeNull()
# --- Step 3: Validate ---
validator = DataValidator(storage)
validator.validateSchema({"name": "object", "age": "int64"})
validator.checkNullability(["name"])
# --- Step 4: Analyse ---
analytics = DataAnalytics(storage)
print(analytics.profile())
# --- Step 5: Save ---
storage.write("cleaned_data.csv")
Running Tests
The full test suite is written with pytest and includes coverage reporting.
Run all tests:
pytest
Run with verbose output:
pytest -v
Run with coverage report:
pytest --cov=data_manager
Roadmap
| Status | Feature |
|---|---|
| ✅ Done | CSV Backend |
| ✅ Done | JSON Backend |
| ✅ Done | In-Memory Backend |
| ✅ Done | Data Validation |
| ✅ Done | Data Analytics & Profiling |
| 🔲 Planned | Excel Backend |
| 🔲 Planned | Parquet Backend |
| 🔲 Planned | SQL Backend |
| 🔲 Planned | Automated EDA Reports |
Contributing
Contributions are welcome. To get started:
- Fork the repository on GitHub.
- Clone your fork and create a new branch:
git checkout -b feature/your-feature-name
- Install the project in editable mode:
pip install -e .
- Make your changes and ensure all tests pass:
pytest -v - Open a pull request against the
mainbranch with a clear description of your changes.
Please keep pull requests focused and scoped to a single concern. Bug fixes, new storage backends, and additional job methods are all welcome.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
Krish Kumar — GitHub
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_manager_framework-0.2.0.tar.gz.
File metadata
- Download URL: data_manager_framework-0.2.0.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a83d092dd5c1d3a78225c272139c7876fb0b1fd94dd040ea4dab7573c27103fa
|
|
| MD5 |
f27ddc06283bda4f6a06189be830429b
|
|
| BLAKE2b-256 |
a4ff4d9a6c8caeaaf3209fdf8996cd227f064a1eca4e3a911112092ccced212b
|
File details
Details for the file data_manager_framework-0.2.0-py3-none-any.whl.
File metadata
- Download URL: data_manager_framework-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ddfa2323eb141bd359bacbda2115a95a8557d12e4bde5e8bdfa22ca39e1695d
|
|
| MD5 |
c1f48d6d199469b5c62199cd29ef4308
|
|
| BLAKE2b-256 |
3f1224ceb390c1a0924f1d9a1bbf2425748e7a0bae5382dab947c55fdd6993f9
|