A modular framework for data engineering, validation, and analytics workflows.
Project description
Data Manager
A modular Python framework for data engineering, validation, and analytics workflows.
DataManager uses a job-based architecture and pluggable storage backends to provide a structured approach for processing tabular datasets.
Features
- Modular Storage Backends: Decoupled storage interfaces allowing seamless switching between In-Memory, CSV, and JSON data layers.
- Extensible Job Architecture: Abstracted base classes (
base_job.py) that make it easy to write custom logic for engineering, validation, and analytics. - Comprehensive Test Suite: Test-driven design featuring deep unit testing and interactive Jupyter Notebooks for debugging and validation.
System Architecture
The framework is divided into three primary layers:
- Storage Layer (
src/data_manager/storage/)
base.py: The abstract base class defining standard data operations.csv_backend.py&json_backend.py: File-based storage implementations.In_memory_backend.py: RAM-based storage for high-speed, temporary data transformations.
- Execution Layer (
src/data_manager/jobs/)
data_engineer.py: Logic for data cleaning, transformation, and feature engineering.data_validator.py: Rules and assertions to ensure data quality and schema integrity.data_analytics.py: Calculation of metrics, aggregations, and business logic.
Directory Structure
๐ฆ data-manager
โฃ ๐ Notebooks # Exploratory data analysis and experimental workflows
โ โ ๐ notes_1.ipynb
โฃ ๐ src
โ โ ๐ data_manager
โ โฃ ๐ config
โ โฃ ๐ core
โ โ โ ๐ base_job.py # Abstract base class for all pipeline jobs
โ โฃ ๐ jobs
โ โ โฃ ๐ data_analytics.py # Analytics and metrics processing
โ โ โฃ ๐ data_engineer.py # ETL and transformation logic
โ โ โ ๐ data_validator.py # Data quality assurance
โ โฃ ๐ storage
โ โ โฃ ๐ In_memory_backend.py
โ โ โฃ ๐ base.py # Storage interface definitions
โ โ โฃ ๐ csv_backend.py
โ โ โ ๐ json_backend.py
โ โ ๐ runner.py
โฃ ๐ tests # Unit and integration tests (Pytest)
โ โฃ ๐ Data # Mock datasets for testing pipelines
โ โฃ ๐ jobs # Tests for specific job implementations
โ โ ๐ storage # Tests for storage backends
โฃ ๐ main.py # Application entry point
โฃ ๐ conftest.py # Pytest configuration and fixtures
โ ๐ pytest.ini # Pytest environment settings
Getting Started
Prerequisites
Ensure you have Python 3.x installed. It is recommended to use a virtual environment.
# Clone the repository
git clone https://github.com/krish50507kumar/data-manager.git
cd data-manager
# Set up a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Installation
pip install -r requirements.txt
Quick Start
from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_analytics import DataAnalytics
storage = CSVStorage()
storage.read("data.csv")
analytics = DataAnalytics(storage)
print(analytics.summary())
example
from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_engineer import DataEngineer
from data_manager.jobs.data_analytics import DataAnalytics
import pandas as pd
df = pd.DataFrame()
mystorage = CSVStorage()
mystorage.store(df)
mydataengineer = DataEngineer(mystorage)
dataengineercontext = [
{
"task":"removeDuplicates",
"function":"removeDuplicates",
"params":{}
},
{
"task":"removeNull",
"function":"removeNull",
"params":{
"method":"const",
"num_const":0,
"category_const":"Unknown"
}
}
]
mydataengineer.run(dataengineercontext)
mydataanalytics = DataAnalytics(mystorage)
dataanalyticscontext = [
{
"task":"Summary_of_the_data",
"function":"summary",
"params":{}
},
{
"task":"Checking_name_column",
"function":"column_stats",
"params":{
}
},
{
"task":"Data_profile",
"function":"profile",
"params":{}
},
{
"task":"grouping_name_with_salary ",
"function":"groupby_analysis",
"params":{}
}
]
mydataanalytics.run(dataanalyticscontext)
# print(MyDataAnalytics.results.get("Summary_of_the_data"))
# print(MyDataAnalytics.results.get("Checking_name_column"))
# print(MyDataAnalytics.results.get("Data_profile"))
mystorage.write(path = "D:\\workspace\\Dev tools\\PythonProjects\\DataManager\\tests\\Data\\test_data_3.csv")
print("THE END")
Usage
To trigger the data pipelines, execute the main entry point:
python main.py
Testing
The framework utilizes pytest to ensure reliability across all modules. The test suite covers isolated unit tests for storage backends, validation logic, and analytics generation, as well as notebook-based integration tests.
To run the full test suite:
# Run all tests in the /tests directory
pytest
# Run tests with detailed verbose output
pytest -v
Current Capabilities
| Component | Features |
|---|---|
| Storage | CSV, JSON, In-Memory |
| Engineering | Remove duplicates, Handle missing values |
| Validation | Schema validation, Nullability checks |
| Analytics | Summary, Profiling, Missing value analysis, Column statistics |
Roadmap
- CSV Backend
- JSON Backend
- In-Memory Backend
- Data Validation
- Data Analytics
- Excel Backend
- Parquet Backend
- SQL Backend
- Automated EDA Reports
Author
Krish Kumar
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_manager_framework-0.1.0.tar.gz.
File metadata
- Download URL: data_manager_framework-0.1.0.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72fc9b5eb0d8e617088d8ca6d71692194356adab0ae73f52d9532adc946d574d
|
|
| MD5 |
62e4b9142cbc3537f5dced1a82461055
|
|
| BLAKE2b-256 |
42b4b02230a846b7cdf8825e4e2080ab798521f3cacc67f455630d2d02d0260e
|
File details
Details for the file data_manager_framework-0.1.0-py3-none-any.whl.
File metadata
- Download URL: data_manager_framework-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a05abb2917f6c8b6414b19aa23d714f6e4295e33f3bdcd2de92274c43becefc4
|
|
| MD5 |
ab5a0ed9855bc7842cb8787bda9cf157
|
|
| BLAKE2b-256 |
9a85e719fab6c2392f2764b3947d472e27aeec4f3d14d8e551945e757fc69a0b
|