Skip to main content

A modular framework for data engineering, validation, and analytics workflows.

Project description

Data Manager

A modular Python framework for data engineering, validation, and analytics workflows.

DataManager uses a job-based architecture and pluggable storage backends to provide a structured approach for processing tabular datasets.

License: MIT

Features

  • Modular Storage Backends: Decoupled storage interfaces allowing seamless switching between In-Memory, CSV, and JSON data layers.
  • Extensible Job Architecture: Abstracted base classes (base_job.py) that make it easy to write custom logic for engineering, validation, and analytics.
  • Comprehensive Test Suite: Test-driven design featuring deep unit testing and interactive Jupyter Notebooks for debugging and validation.

System Architecture

The framework is divided into three primary layers:

  1. Storage Layer (src/data_manager/storage/)
  • base.py: The abstract base class defining standard data operations.
  • csv_backend.py & json_backend.py: File-based storage implementations.
  • In_memory_backend.py: RAM-based storage for high-speed, temporary data transformations.
  1. Execution Layer (src/data_manager/jobs/)
  • data_engineer.py: Logic for data cleaning, transformation, and feature engineering.
  • data_validator.py: Rules and assertions to ensure data quality and schema integrity.
  • data_analytics.py: Calculation of metrics, aggregations, and business logic.

Directory Structure

๐Ÿ“ฆ data-manager
 โ”ฃ ๐Ÿ“‚ Notebooks                 # Exploratory data analysis and experimental workflows
 โ”ƒ โ”— ๐Ÿ“œ notes_1.ipynb
 โ”ฃ ๐Ÿ“‚ src
 โ”ƒ โ”— ๐Ÿ“‚ data_manager
 โ”ƒ   โ”ฃ ๐Ÿ“‚ config                
 โ”ƒ   โ”ฃ ๐Ÿ“‚ core
 โ”ƒ   โ”ƒ โ”— ๐Ÿ“œ base_job.py         # Abstract base class for all pipeline jobs
 โ”ƒ   โ”ฃ ๐Ÿ“‚ jobs
 โ”ƒ   โ”ƒ โ”ฃ ๐Ÿ“œ data_analytics.py   # Analytics and metrics processing
 โ”ƒ   โ”ƒ โ”ฃ ๐Ÿ“œ data_engineer.py    # ETL and transformation logic
 โ”ƒ   โ”ƒ โ”— ๐Ÿ“œ data_validator.py   # Data quality assurance
 โ”ƒ   โ”ฃ ๐Ÿ“‚ storage
 โ”ƒ   โ”ƒ โ”ฃ ๐Ÿ“œ In_memory_backend.py
 โ”ƒ   โ”ƒ โ”ฃ ๐Ÿ“œ base.py             # Storage interface definitions
 โ”ƒ   โ”ƒ โ”ฃ ๐Ÿ“œ csv_backend.py
 โ”ƒ   โ”ƒ โ”— ๐Ÿ“œ json_backend.py
 โ”ƒ   โ”— ๐Ÿ“œ runner.py             
 โ”ฃ ๐Ÿ“‚ tests                     # Unit and integration tests (Pytest)
 โ”ƒ โ”ฃ ๐Ÿ“‚ Data                    # Mock datasets for testing pipelines
 โ”ƒ โ”ฃ ๐Ÿ“‚ jobs                    # Tests for specific job implementations
 โ”ƒ โ”— ๐Ÿ“‚ storage                 # Tests for storage backends
 โ”ฃ ๐Ÿ“œ main.py                   # Application entry point
 โ”ฃ ๐Ÿ“œ conftest.py               # Pytest configuration and fixtures
 โ”— ๐Ÿ“œ pytest.ini                # Pytest environment settings

Getting Started

Prerequisites

Ensure you have Python 3.x installed. It is recommended to use a virtual environment.

# Clone the repository
git clone https://github.com/krish50507kumar/data-manager.git
cd data-manager

# Set up a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Installation

pip install -r requirements.txt

Quick Start

from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_analytics import DataAnalytics

storage = CSVStorage()
storage.read("data.csv")

analytics = DataAnalytics(storage)

print(analytics.summary())

example

from data_manager.storage.csv_backend import CSVStorage
from data_manager.jobs.data_engineer import DataEngineer
from data_manager.jobs.data_analytics import DataAnalytics
import pandas as pd
df = pd.DataFrame()
mystorage = CSVStorage()
mystorage.store(df)

mydataengineer = DataEngineer(mystorage)

dataengineercontext = [
    {
        "task":"removeDuplicates",
        "function":"removeDuplicates",
        "params":{}
    },
    {
        "task":"removeNull",
        "function":"removeNull",
        "params":{
            "method":"const",
            "num_const":0,
            "category_const":"Unknown"
        }
    }
]

mydataengineer.run(dataengineercontext)

mydataanalytics = DataAnalytics(mystorage)

dataanalyticscontext = [
    {
        "task":"Summary_of_the_data",
        "function":"summary",
        "params":{}
    },
    {
        "task":"Checking_name_column",
        "function":"column_stats",
        "params":{
        }
    },
    {
        "task":"Data_profile",
        "function":"profile",
        "params":{}
    },
    {
        "task":"grouping_name_with_salary ",
        "function":"groupby_analysis",
        "params":{}
    }
]

mydataanalytics.run(dataanalyticscontext)

# print(MyDataAnalytics.results.get("Summary_of_the_data"))
# print(MyDataAnalytics.results.get("Checking_name_column"))
# print(MyDataAnalytics.results.get("Data_profile"))

mystorage.write(path = "D:\\workspace\\Dev tools\\PythonProjects\\DataManager\\tests\\Data\\test_data_3.csv")

print("THE END")

Usage

To trigger the data pipelines, execute the main entry point:

python main.py

Testing

The framework utilizes pytest to ensure reliability across all modules. The test suite covers isolated unit tests for storage backends, validation logic, and analytics generation, as well as notebook-based integration tests.

To run the full test suite:

# Run all tests in the /tests directory
pytest

# Run tests with detailed verbose output
pytest -v

Current Capabilities

Component Features
Storage CSV, JSON, In-Memory
Engineering Remove duplicates, Handle missing values
Validation Schema validation, Nullability checks
Analytics Summary, Profiling, Missing value analysis, Column statistics

Roadmap

  • CSV Backend
  • JSON Backend
  • In-Memory Backend
  • Data Validation
  • Data Analytics
  • Excel Backend
  • Parquet Backend
  • SQL Backend
  • Automated EDA Reports

Author

Krish Kumar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_manager_framework-0.1.0.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_manager_framework-0.1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file data_manager_framework-0.1.0.tar.gz.

File metadata

  • Download URL: data_manager_framework-0.1.0.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for data_manager_framework-0.1.0.tar.gz
Algorithm Hash digest
SHA256 72fc9b5eb0d8e617088d8ca6d71692194356adab0ae73f52d9532adc946d574d
MD5 62e4b9142cbc3537f5dced1a82461055
BLAKE2b-256 42b4b02230a846b7cdf8825e4e2080ab798521f3cacc67f455630d2d02d0260e

See more details on using hashes here.

File details

Details for the file data_manager_framework-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_manager_framework-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a05abb2917f6c8b6414b19aa23d714f6e4295e33f3bdcd2de92274c43becefc4
MD5 ab5a0ed9855bc7842cb8787bda9cf157
BLAKE2b-256 9a85e719fab6c2392f2764b3947d472e27aeec4f3d14d8e551945e757fc69a0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page