Skip to main content

Machine learning-specific feature engineering utilities including models and evaluation tools.

Project description

dsr-feature-eng-ml

PyPI version Python versions License Changelog

Comprehensive machine learning model evaluation and feature engineering framework.

Version 1.0.0: This release is breaking and not backward-compatible with prior 0.x versions.

Release scope: Regression workflows have been tested. Classification workflows are implemented but not yet tested; a follow-up release will expand validation and coverage.

Features

  • Model Evaluation: Automatic hyperparameter tuning and model comparison for Decision Trees, Random Forests, and Logistic Regression
  • Data Balancing: Support for imbalanced dataset handling (upsampling, downsampling, balanced class weights)
  • Feature Importance: Automatic feature selection and importance ranking
  • Data Splitting: Intelligent train/validation/test splitting with automatic feature scaling
  • Result Tracking: Comprehensive model configuration and performance metrics tracking

Installation

pip install dsr-feature-eng-ml

Quick Start

import pandas as pd
from dsr_feature_eng_ml import DataSplits, ModelEvaluation

# Load your data
df = pd.read_csv('data.csv')

# Create data splits (with automatic scaling)
data_splits = DataSplits.from_data_source(
    src=df,
    features_to_include=['feature1', 'feature2', 'feature3'],
    target_column='target',
    test_size=0.2,
    valid_size=0.25,
    random_state=42,
    scale_features=True
)

# Evaluate models
results = ModelEvaluation.evaluate_dataset(
    data_splits=data_splits,
    dtree_param_grid={'max_depth': [5, 10, 20]},
    rf_param_grid={'n_estimators': [50, 100]},
    lr_param_grid={'C': [0.1, 1.0, 10.0]},
    cv=5,
    n_iter=50,
    max_iter=1000,
    scoring='f1',
    n_jobs=-1,
    viable_f1_gap=0.01,
    report_title='Model Evaluation',
    perform_dtree_feature_selection=True,
    perform_rf_feature_selection=True
)

Key Components

DataSplits

Manages train/validation/test splits with automatic feature scaling:

  • Fits scaler on training data only (prevents data leakage)
  • Transforms validation and test sets consistently
  • Supports upsampling and downsampling for class imbalance

ModelEvaluation

Orchestrates comprehensive model evaluation:

  • Evaluates multiple model types in parallel
  • Supports four balancing strategies
  • Tracks best performing models
  • Generates detailed evaluation reports

Model Classes

  • DecisionTree: Decision Tree classifier with feature importance
  • RandomForest: Random Forest classifier with ensemble methods
  • LogisticRegression: Logistic Regression with convergence control

Requirements

  • Python >= 3.10
  • pandas
  • numpy
  • scikit-learn >= 1.5.0
  • seaborn >= 0.13.0
  • dsr-data-tools >= 1.0.0
  • dsr-utils >= 1.0.0

Architecture

The library uses a modular approach:

  • evaluation/: Core evaluation pipeline (DataSplits, ModelEvaluation, ModelResults)
  • models/: Model implementations and hyperparameter tuning
  • enums.py: Enumeration types for model states and configurations
  • constants.py: Global configuration and defaults

Preferences and Overrides

You can override library defaults (like constants used in evaluation and reporting) without changing code in the library.

Precedence (highest to lowest)

  • Runtime override via set_pref()
  • Environment variables prefixed with DSR_FEML_
  • User config file in ~/.config/dsr-feature-eng-ml/config.toml or ~/Library/Application Support/dsr-feature-eng-ml/config.toml
  • Project-level ./dsr_feature_eng_ml.toml
  • In-library default value

Examples

  • Runtime (Python):
    from dsr_feature_eng_ml import set_pref
    set_pref("REPORT_WIDTH", 120)
    set_pref("SCORE_FORMAT", ".3f")
    
  • Environment (shell):
    export DSR_FEML_REPORT_WIDTH=120
    export DSR_FEML_SCORE_FORMAT=.3f
    export DSR_FEML_DEFAULT_ACCEPTABLE_GAP=0.03
    
  • Config file (TOML):
    [constants]
    REPORT_WIDTH = 120
    SCORE_FORMAT = ".3f"
    DEFAULT_ACCEPTABLE_GAP = 0.03
    

How it works

  • constants.py defines defaults and resolves effective values through the preferences system:
    from dsr_feature_eng_ml.preferences import resolve_constant
    SCORE_FORMAT = resolve_constant("SCORE_FORMAT", ".4f")
    REPORT_WIDTH = resolve_constant("REPORT_WIDTH", 100)
    
  • Most code should continue to import these constants (e.g., from dsr_feature_eng_ml import REPORT_WIDTH).

Should I call resolve_constant() directly?

  • No for typical usage: import constants as usual, they already reflect preferences at import time.
  • Yes if you need late-binding (e.g., react to set_pref() after modules are imported). In that case, call get_pref("REPORT_WIDTH", 100) or resolve_constant("REPORT_WIDTH", 100) where you need the value.

This keeps defaults centralized while giving users clean override hooks at runtime, via environment, or via config files.

License

MIT License - see LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsr_feature_eng_ml-1.0.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsr_feature_eng_ml-1.0.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file dsr_feature_eng_ml-1.0.0.tar.gz.

File metadata

  • Download URL: dsr_feature_eng_ml-1.0.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dsr_feature_eng_ml-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1bff5a1b00e994c064be42c649b710940892c72b500e102a8d486bc8df511be3
MD5 62470a0d02d3eec44d8088223f43e97d
BLAKE2b-256 a524fd6358d827dc3b24d896209111bc00bd8e308e888b31cdda9ce5df06e750

See more details on using hashes here.

Provenance

The following attestation bundles were made for dsr_feature_eng_ml-1.0.0.tar.gz:

Publisher: python-publish.yml on scottroberts140/dsr-feature-eng-ml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dsr_feature_eng_ml-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dsr_feature_eng_ml-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 81def7f6e6a6d7955caeb1eafe43f296e2032f44927f475476554ffd9ff0af86
MD5 e3bb9120f3561fd8ef853bff70cc776d
BLAKE2b-256 2687aec0f4e1edf745f1bab24634032ffe9ce2fcd56f915801c8afc2feb61f2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for dsr_feature_eng_ml-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on scottroberts140/dsr-feature-eng-ml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page