Machine learning-specific feature engineering utilities including models and evaluation tools.
Project description
dsr-feature-eng-ml
This suite provides a high-fidelity framework for training, evaluating, and auditing machine learning models. It is designed to move beyond simple accuracy metrics, providing deep insights into model generalization, data drift, and hardware efficiency.
Version 1.2.0: This release adds new defaults and incremental improvements while remaining compatible with 1.2.0.
Release scope: Regression workflows have been tested. Classification workflows are implemented but not yet tested; a follow-up release will expand validation and coverage.
Core Capabilities
- Automated Model Auditing: Orchestrates competitive sweeps across multiple model architectures with built-in hyperparameter tuning and cross-validation.
- Statistical Drift Analysis: Automatically calculates mean and standard deviation deltas, skewness, and kurtosis across train/val/test splits to identify data inconsistency.
- Intelligent Resampling: Features exact balancing strategies for classification tasks, ensuring minority and majority classes are perfectly aligned during training.
- Memory-Safe Operations: Includes predictive memory auditing to prevent Out-of-Memory (OOM) errors during large-scale tuning and a side-car serialization strategy for handling massive prediction arrays.
- Comprehensive Reporting: Generates multi-page PDF audit reports featuring leaderboards, deep-dive residual analysis, and feature importance visualizations.
Multi-Format Export Capabilities
The ModelAuditor provides a robust export engine via the ModelAuditSummary class, allowing audit results to be persisted in several formats for different downstream use cases:
- Audit PDF Report: Generates a high-fidelity visual document including the executive summary, model leaderboards, and deep-dive residual analysis.
- JOBLIB Snapshot: Persists the full, executable state of the ModelAuditSummary object. This includes a memory-optimized "side-car" process that detaches large prediction arrays during the write operation to ensure system stability before reattaching them.
- Excel Workbook: Creates a multi-sheet report containing the Audit Summary (metadata), the Leaderboard (performance results), an Anomaly Log (outlier data), and comprehensive Feature Metadata.
- JSON Payload: Exports a serializable, nested dictionary containing the complete audit snapshot, metadata, and per-model results, suitable for web integration or programmatic review.
- CSV Collection: Produces a set of tabular files for flat-file analysis, including distinct files for the leaderboard results, metadata summary, anomaly data, and dynamic feature context.
Audit Metrics Definitions
- Quality Score: A 0–100 metric assessing model stability; penalized if "cleaned" performance (after outlier removal) significantly diverges from raw performance.
- Drift Index: The percentage difference between training and validation target means, used to identify potential data shift.
- Generalization Gap: The absolute difference between training and validation scores (e.g., R² Gap); used to classify models as Well-Fit, Marginal, or Overfit.
- Efficiency: Measured in rows processed per second, providing context on model throughput relative to hardware resources.
Installation
pip install dsr-feature-eng-ml
Quick Start
import pandas as pd
from dsr_feature_eng_ml import DataSplits, ModelEvaluation
# Load your data
df = pd.read_csv('data.csv')
# Create data splits (with automatic scaling)
data_splits = DataSplits.from_data_source(
src=df,
features_to_include=['feature1', 'feature2', 'feature3'],
target_column='target',
test_size=0.2,
valid_size=0.25,
random_state=42,
scale_features=True
)
# Evaluate models
results = ModelEvaluation.evaluate_dataset(
data_splits=data_splits,
dtree_param_grid={'max_depth': [5, 10, 20]},
rf_param_grid={'n_estimators': [50, 100]},
lr_param_grid={'C': [0.1, 1.0, 10.0]},
cv=5,
n_iter=50,
max_iter=1000,
scoring='f1',
n_jobs=-1,
viable_f1_gap=0.01,
report_title='Model Evaluation',
perform_dtree_feature_selection=True,
perform_rf_feature_selection=True
)
Key Components
DataSplits
Manages train/validation/test splits with automatic feature scaling:
- Fits scaler on training data only (prevents data leakage)
- Transforms validation and test sets consistently
- Supports upsampling and downsampling for class imbalance
ModelEvaluation
Orchestrates comprehensive model evaluation:
- Evaluates multiple model types in parallel
- Supports four balancing strategies
- Tracks best performing models
- Generates detailed evaluation reports
Model Classes
- DecisionTree: Decision Tree classifier with feature importance
- RandomForest: Random Forest classifier with ensemble methods
- LogisticRegression: Logistic Regression with convergence control
Requirements
- Python >= 3.11
- dsr-utils >= 1.3.0
- dsr-data-tools >= 1.2.0
- numpy >= 2.4.4
- pandas >= 3.0.2
- scikit-learn >= 1.8.0
- matplotlib >= 3.10.8
- seaborn >= 0.13.2
Architecture
The library uses a modular approach:
evaluation/: Core evaluation pipeline (DataSplits, ModelEvaluation, ModelResults)models/: Model implementations and hyperparameter tuningenums.py: Enumeration types for model states and configurationsconstants.py: Global configuration and defaults
Preferences and Overrides
You can override library defaults (like constants used in evaluation and reporting) without changing code in the library.
Precedence (highest to lowest)
- Runtime override via
set_pref() - Environment variables prefixed with
DSR_FEML_ - User config file in
~/.config/dsr-feature-eng-ml/config.tomlor~/Library/Application Support/dsr-feature-eng-ml/config.toml - Project-level
./dsr_feature_eng_ml.toml - In-library default value
Examples
-
Runtime (Python):
from dsr_feature_eng_ml import set_pref set_pref("REPORT_WIDTH", 120) set_pref("SCORE_FORMAT", ".3f")
-
Environment (shell):
export DSR_FEML_REPORT_WIDTH=120 export DSR_FEML_SCORE_FORMAT=.3f export DSR_FEML_DEFAULT_ACCEPTABLE_GAP=0.03
-
Config file (TOML):
[constants] REPORT_WIDTH = 120 SCORE_FORMAT = ".3f" DEFAULT_ACCEPTABLE_GAP = 0.03
How it works
-
constants.pydefines defaults and resolves effective values through the preferences system:from dsr_feature_eng_ml.preferences import resolve_constant SCORE_FORMAT = resolve_constant("SCORE_FORMAT", ".4f") REPORT_WIDTH = resolve_constant("REPORT_WIDTH", 100)
-
Most code should continue to import these constants (e.g.,
from dsr_feature_eng_ml import REPORT_WIDTH).
Should I call resolve_constant() directly?
- No for typical usage: import constants as usual, they already reflect preferences at import time.
- Yes if you need late-binding (e.g., react to
set_pref()after modules are imported). In that case, callget_pref("REPORT_WIDTH", 100)orresolve_constant("REPORT_WIDTH", 100)where you need the value.
This keeps defaults centralized while giving users clean override hooks at runtime, via environment, or via config files.
License
MIT License - see LICENSE file for details
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dsr_feature_eng_ml-1.2.0.tar.gz.
File metadata
- Download URL: dsr_feature_eng_ml-1.2.0.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e81912607f7516e704d11059e6b1ae40a7bdb849ec1e50d7032768aa6e90d4a
|
|
| MD5 |
e9cdfdb508f4638531cc25ba6d4c2490
|
|
| BLAKE2b-256 |
7c8243c3efc027b126c35793f4388de4bf20af3705b35f55c8bc290f8d331d27
|
Provenance
The following attestation bundles were made for dsr_feature_eng_ml-1.2.0.tar.gz:
Publisher:
python-publish.yml on scottroberts140/dsr-feature-eng-ml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dsr_feature_eng_ml-1.2.0.tar.gz -
Subject digest:
3e81912607f7516e704d11059e6b1ae40a7bdb849ec1e50d7032768aa6e90d4a - Sigstore transparency entry: 1278875681
- Sigstore integration time:
-
Permalink:
scottroberts140/dsr-feature-eng-ml@f71708506aa84f50eb18eed672e28f0f80d53834 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/scottroberts140
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f71708506aa84f50eb18eed672e28f0f80d53834 -
Trigger Event:
release
-
Statement type:
File details
Details for the file dsr_feature_eng_ml-1.2.0-py3-none-any.whl.
File metadata
- Download URL: dsr_feature_eng_ml-1.2.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9e4071379a369670ee62b452f65f4412dc809b529e8bbd4e2673b03d5a619db
|
|
| MD5 |
eefb8153f97cc081f336db6bf2076659
|
|
| BLAKE2b-256 |
7cfe7206bfeb196e81780ae6ca7da7b43e4018ec101d7a9798e57a963e070d13
|
Provenance
The following attestation bundles were made for dsr_feature_eng_ml-1.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on scottroberts140/dsr-feature-eng-ml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dsr_feature_eng_ml-1.2.0-py3-none-any.whl -
Subject digest:
c9e4071379a369670ee62b452f65f4412dc809b529e8bbd4e2673b03d5a619db - Sigstore transparency entry: 1278875726
- Sigstore integration time:
-
Permalink:
scottroberts140/dsr-feature-eng-ml@f71708506aa84f50eb18eed672e28f0f80d53834 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/scottroberts140
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f71708506aa84f50eb18eed672e28f0f80d53834 -
Trigger Event:
release
-
Statement type: