Skip to main content

A reusable data science toolkit for production-ready pipelines

Project description

deepsim-dskit - A Reusable Data Science Framework

deepsim-dskit is an installable Python package for reproducible, configuration-driven data science pipelines. It provides reusable building blocks for loading data, preprocessing, splitting, modeling, artifact management, and experiment runs.

Installation

pip install -e ".[dev]"

Optional extras:

pip install "deepsim-dskit[polars]"
pip install "deepsim-dskit[yaml]"

Quick Start

from dskit import load_dataset, create_split

df = load_dataset("data/advertising.csv", index_col=0)
split = create_split(df, target="sales", test_size=0.2, random_state=42)

Run a full experiment from a config dictionary:

from dskit import run_full_pipeline

config = {
    "experiment_id": "advertising_baseline",
    "seed": 42,
    "data": {
        "path": "data/advertising.csv",
        "target": "sales",
        "read_kwargs": {"index_col": 0},
    },
    "splitting": {"test_size": 0.2, "val_size": 0.1, "random_state": 42},
    "preprocessing": {
        "missing": {"strategies": {}, "indicator_columns": []},
        "outliers": {"columns": [], "method": "iqr", "multiplier": 1.5},
        "scaling": {"columns": ["TV", "radio", "newspaper"], "method": "standard"},
    },
    "models": {
        "linear": {"class": "LinearRegression", "params": {}},
        "ridge": {"class": "Ridge", "params": {"alpha": 1.0}},
    },
    "output": {
        "experiments_dir": "experiments",
        "registry_path": "registry/experiments.json",
    },
}

result = run_full_pipeline(config)
print(result["best_model_name"])
print(result["metrics"]["test_r2"])

The output block may also include logs_dir (defaults to "logs").

CLI

dskit-run --version
dskit-run --config configs/advertising.json --dry-run
dskit-run --config configs/advertising.json --env production

What's Included

Module Purpose
data_io Load, validate, and save datasets
eda Exploratory summaries
preprocessing Imputation, outlier treatment, scaling
splitting Reproducible train/test/validation splits
pipeline Fit/transform preprocessing pipeline
feature_engineering Encoding and feature construction
modeling Training, evaluation, and ModelRegistry
persistence Save and load artifacts
artifacts Experiment artifacts and registry helpers
reproducibility Config-driven experiment execution
config Config validation and environment profiles
performance Profiling and optimization helpers

License

MIT License. See LICENSE.

Author

Shouke Wei, PhD · Deepsim Press Author Page Affiliation: Deepsim Intelligence Technology Inc. deepsim.ca

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsim_dskit-1.0.0.tar.gz (56.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepsim_dskit-1.0.0-py3-none-any.whl (56.1 kB view details)

Uploaded Python 3

File details

Details for the file deepsim_dskit-1.0.0.tar.gz.

File metadata

  • Download URL: deepsim_dskit-1.0.0.tar.gz
  • Upload date:
  • Size: 56.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for deepsim_dskit-1.0.0.tar.gz
Algorithm Hash digest
SHA256 bf0adba60eae5dfe41ba676efa6337ef91201aead15985d46a0d87a46b81bffb
MD5 6c1a1ed64a7e77e159d11f19cd7889f9
BLAKE2b-256 dda8a821a7cca7beef7e2b3bb079cfdf0e85f445390e8bf82e3502329c82bd85

See more details on using hashes here.

File details

Details for the file deepsim_dskit-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: deepsim_dskit-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 56.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for deepsim_dskit-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e2b95628eea1f6610ffc9f8fbd855f407e74f1929762c53ee227d5b9e2ec482
MD5 7ed23926d045db7617219af02fd764af
BLAKE2b-256 858f50ff074fdef4fecc1518f1a10861eb3213a2d42c44b2fcd65e0790fac9e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page