DSTools: Data Science Tools Library

These details have not been verified by PyPI

Project description

DSTools: Data Science Research Toolkit

Short intro
Key Features
Installation
Function Overview
Examples of usage
Available Tools
Authors
Contributing
References
License

Short intro

DSTools is a Python library designed to assist data scientists and researchers by providing a collection of helpful functions for various stages of a data science project, from data exploration and preprocessing to model evaluation and synthetic data generation.

The library is built upon the author's extensive multi-decade experience (30+ years) in data science, statistical modeling, and enterprise software development. Drawing from real-world challenges encountered across diverse industries including finance, banking, healthcare, insurance, and e-commerce, this toolkit addresses common pain points that practitioners face daily in their analytical workflows.

The development philosophy emphasizes practical utility over theoretical complexity, incorporating battle-tested patterns and methodologies that have proven effective in production environments. Each function and module reflects lessons learned from managing large-scale data projects, optimizing computational performance, and ensuring code maintainability in collaborative team settings.

The library encapsulates best practices developed through years of consulting work, academic research collaborations, and hands-on problem-solving in high-stakes business environments. It represents a distillation of proven techniques, streamlined workflows, and robust error-handling approaches that have evolved through countless iterations and real-world applications.

This comprehensive toolkit serves as a bridge between theoretical data science concepts and practical implementation needs, offering developers and researchers a reliable foundation built on decades of field-tested expertise and continuous refinement based on community feedback and emerging industry requirements. This library with helper functions to accelerate and simplify various stages of the data science research cycle.

This toolkit is built on top of popular libraries like Pandas, Polars, Scikit-learn, Optuna, and Matplotlib, providing a higher-level API for common tasks in Exploratory Data Analysis (EDA), feature preprocessing, model evaluation, and synthetic data generation. It is designed for data scientists, analysts, and researchers who want to write cleaner, more efficient, and more reproducible code.

Key Features

Advanced Data Analysis: Get quick and detailed statistics for numerical and categorical columns.
Powerful Visualizations: Generate insightful correlation matrices and confusion matrices with a single function call.
Comprehensive Model Evaluation: Calculate a wide range of classification metrics and visualize performance curves effortlessly.
Synthetic Data Generation: Create datasets with specific statistical properties (mean, median, std, skew, kurtosis) for robust testing and simulation. Create complex numerical distributions matching specific statistical moments (generate_distribution, generate_distribution_from_metrics).
Efficient Preprocessing: Encode categorical variables, handle outliers, and create features from missing values.
Utility Functions: A collection of helpers for stationarity testing, data validation, and file I/O operations.
Data Exploration: Quickly get statistics for numerical and categorical features (describe_numeric, describe_categorical), check for missing values (check_NINF), and visualize correlations (corr_matrix).
Model Evaluation: Comprehensive classification model evaluation (evaluate_classification, compute_metrics) with clear visualizations (plot_confusion_matrix).
Data Preprocessing: Encode categorical variables (labeling), handle outliers (remove_outliers_iqr), and scale features (min_max_scale).
Time Series Analysis: Test for stationarity using the Dickey-Fuller test (test_stationarity).
Advanced Statistics: Calculate non-parametric correlation (chatterjee_correlation), entropy, and KL-divergence.
Utilities: Save/load DataFrames to/from ZIP archives, generate random alphanumeric codes, and more.

What's New in Version 2.0.0

This version marks a major architectural refactoring of the library, focusing on modularity, performance, and advanced ML features.

✨ Modular Design: The toolkit is now re-organized into logical namespaces. Instead of a single flat API, you now access functionality through tools.metrics, tools.distance, etc.
🚀 High-Performance Backends: Major functions in metrics and distance now automatically leverage GPU acceleration (CuPy) and parallel CPU execution (Numba) for significant speedups on large datasets.
🤖 Gradient Calculation: Key loss functions (like mse, mae, huber_loss) can now return their gradients (return_grad=True), making them suitable for custom training loops in ML frameworks.
📈 Training Monitoring: A new real-time monitoring system has been added to the metrics module to track and plot metrics during model training.

TODO & Future Plans

This library is actively maintained and will be expanded to cover more aspects of the daily data science workflow. The focus remains on providing high-performance, easy-to-use tools for common and resource-intensive tasks.

Here is the development roadmap:

Expand Core Modules:
- Add more loss and another functions and metrics to tools.metrics (e.g. for classification, clusterization, etc.).
- Implement more distance measures in tools.distance (e.g., Levenshtein for strings, Silhouette, etc.).
New Preprocessing Module:
- Develop high-performance feature scaling and encoding functions.
- Add utilities for handling time-series data.
New Visualization Module:
- Create simple wrappers around Matplotlib/Seaborn for common plots (e.g., feature distribution, ROC curves).
Community & Contributions:
- Improve documentation with more examples.
- Create contribution guidelines (CONTRIBUTING.md).

Your feature requests and contributions are highly encouraged! Please open an issue to suggest a new function.

Installation

Clone the Repository

git clone https://github.com/s-kav/ds_tools.git

Install dscience-tools directly from PyPI:

pip install dscience-tools

Navigate to the Project Directory

cd ds_tools

Install Dependencies

Ensure you have Python version 3.8 or higher and install the required packages:

pip install -r requirements.txt

Function Overview

The library provides a wide range of functions. To see a full, formatted list of available tools, you can use the function_list method:

from ds_tools import DSTools

tools = DSTools()
tools.function_list()

Examples of usage

Here're some simple examples of how to use this library.

Using the Metrics Module

Calculate Mean Absolute Error and its gradient. The best backend (GPU/Numba/NumPy) is chosen automatically.

import numpy as np

y_true = np.array()
y_pred = np.array([1.1, 2.2, 2.8, 4.3])

# Calculate only the loss value
mae_loss = tools.metrics.mae(y_true, y_pred)
print(f"MAE Loss: {mae_loss:.4f}")

# Calculate both loss and its gradient
loss, grad = tools.metrics.mae(y_true, y_pred, return_grad=True)
print(f"Gradient: {grad}")

Using the Distance Module

u = np.array()
v = np.array()

euc_dist = tools.distance.euclidean(u, v)
print(f"Euclidean Distance: {euc_dist:.4f}")

Real-time Training Monitoring

# 1. Start monitoring
tools.metrics.start_monitoring()

# 2. Simulate training loop
for epoch in range(10):
    # Dummy loss values
    loss = 1 / (epoch + 1)
    val_loss = 1.2 / (epoch + 1) + np.random.rand() * 0.1
    
    # Update history at the end of each epoch
    tools.metrics.update(epoch, logs={'loss': loss, 'val_loss': val_loss})

# 3. Get history as a DataFrame or plot it
history_df = tools.metrics.get_history_df()
print(history_df)

tools.metrics.plot_history()

Evaluate a classification model.

Tired of writing boilerplate code to see your model's performance? Use evaluate_classification for a complete summary.

import numpy as np
from ds_tools import DSTools

# 1. Initialize the toolkit
tools = DSTools()

# 2. Generate some dummy data
y_true = np.array([0, 1, 1, 0, 1, 0, 0, 1])
y_probs = np.array([0.1, 0.8, 0.6, 0.3, 0.9, 0.2, 0.4, 0.7])

# 3. Get a comprehensive evaluation report
# This will print metrics and show plots for ROC and Precision-Recall curves.
results = tools.evaluate_classification(true_labels=y_true, pred_probs=y_probs)

# The results are also returned as a dictionary
print(f"\nROC AUC Score: {results['roc_auc']:.4f}")

This will produce:

A detailed printout of key metrics (Accuracy, ROC AUC, Average Precision, etc.).
A full classification report.
A confusion matrix.
Beautifully plotted ROC and Precision-Recall curves.

Example of classification metrics, report, and confusion matrix (at threshold = 0.7)

Example of precision vs recall and ROC (TPR vs FPR) curves

Example of classification metrics, report, and confusion matrix (at threshold = 0.5, for comparison)

Generating a Synthetic Distribution.

Need to create a dataset with specific statistical properties? - generate_distribution_from_metrics can do that.

from ds_tools import DSTools, DistributionConfig

tools = DSTools()

# Define the desired metrics
metrics_config = DistributionConfig(
    mean=1042,
    median=330,
    std=1500,
    min_val=1,
    max_val=120000,
    skewness=13.2,
    kurtosis=245, # Excess kurtosis
    n=10000
)

# Generate the data
generated_data = tools.generate_distribution_from_metrics(n=10000, metrics=metrics_config)

print(f"Generated Mean: {np.mean(generated_data):.2f}")
print(f"Generated Std: {np.std(generated_data):.2f}")

Comparative analysis of target statistical parameters against actual generated data results (scenario A)

Comparative analysis of target statistical parameters against actual generated data results (scenario B)

Correlation Matrix Heatmap

Visualize the relationships in your data with a highly customizable correlation matrix.

# --- Sample Data ---
data = {
    'feature_a': np.random.rand(100) * 100,
    'feature_b': np.random.rand(100) * 50 + 25,
    'feature_c': np.random.rand(100) * -80,
}
df = pd.DataFrame(data)
df['feature_d'] = df['feature_a'] * 1.5 + np.random.normal(0, 10, 100)

# --- Generate a Spearman correlation matrix ---
config = CorrelationConfig(build_method='spearman', font_size=12)
tools.corr_matrix(df, config=config)

This will display a publication-quality heatmap, masked to show only the lower triangle for clarity, using the Spearman correlation method.

Example of correlation matrix (by Pearson)

Example of correlation matrix (by Spearman)

Detailed Categorical Analysis.

Quickly understand the distribution of your categorical features.

# --- Sample Data ---
data = {
    'city': ['London', 'Paris', 'London', 'New York', 'Paris', 'London'],
    'status': ['Active', 'Inactive', 'Active', 'Active', 'Inactive', 'Active']
}
df = pd.DataFrame(data)

# --- Get stats for a column ---
tools.category_stats(df, 'city')

=========================== output
           city
     uniq_names amount_values  percentage
0        London             3       50.00
1         Paris             2       33.33
2      New York             1       16.67

Plot confusion matrix.

Helps to plot confusion matrix in graphical kind, especially for calssification tasks.

np.random.seed(42)
N_SAMPLES = 1500

y_true_multi = np.random.randint(0, 3, size=N_SAMPLES)
correct_preds = np.random.rand(N_SAMPLES) < 0.75
y_pred_multi = np.where(correct_preds, y_true_multi, (y_true_multi + random_errors) % 3)

plot_confusion_matrix(
y_true_multi,
y_pred_multi,
class_labels=['Cat', 'Dog', 'Bird'],
title='Multi-Class Classification (Animals)',
cmap='YlGnBu'
)

Example of confusion matrix plotting (for binary classification)

Example of confusion matrix plotting (for multiclass classification)

Example of benchmarking for MAE implementation

Example of benchmarking for RMSE implementation

Example of benchmarking for R2 implementation

Full code base for other function testing you can find here.

Available Tools

The library is now organized into logical modules. Here is an overview of the available toolkits:

Core Toolkit (`tools.*`)

General-purpose utilities for data analysis and manipulation.

function_list: Prints a list of all available tools.
corr_matrix: Calculates and visualizes a correlation matrix.
category_stats: Provides detailed statistics for categorical columns.
remove_outliers_iqr: Replaces or removes outliers using the IQR method.
stat_normal_testing: Performs normality tests on a distribution.
... and more.

Metrics Toolkit (`tools.metrics.*`)

A high-performance toolkit for calculating loss functions and their gradients.

Regression Losses: mae, mse, rmse, huber_loss, quantile_loss.
Classification Losses: hinge_loss, log_loss (Binary Cross-Entropy).
Embedding Losses: triplet_loss.
Monitoring: start_monitoring, update, get_history_df, plot_history.

Distance Toolkit (`tools.distance.*`)

A high-performance toolkit for calculating distances and similarities.

Vector-to-Vector: euclidean, manhattan, cosine_similarity, minkowski, chebyshev, mahalanobis, haversine, hamming, jaccard.
Matrix Operations: pairwise_euclidean, kmeans_distance.
Neighbor Searches: knn_distances, radius_neighbors.

Authors

@sergiikavun

Contributing

See CONTRIBUTING

References

For citing you should use:

Sergii Kavun. (2025). s-kav/ds_tools: Version 2.0.0 (v.2.0.0). Zenodo. https://doi.org/10.5281/zenodo.17080822

License

This project uses dual licensing:

🎓 Free for Academic & Research: PolyForm Noncommercial 1.0.0
💼 Commercial License Available: Contact us for business use License

📋 Full License Details | 💰 Get Commercial License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.3.4

May 10, 2026

2.3.3

Mar 23, 2026

2.3.2

Dec 30, 2025

2.3.0

Dec 19, 2025

This version

2.0.1

Sep 17, 2025

2.0.0

Sep 8, 2025

1.0.9

Jul 22, 2025

1.0.8

Jul 22, 2025

1.0.7

Jul 21, 2025

1.0.6

Jul 15, 2025

1.0.0

Jul 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dscience_tools-2.0.1.tar.gz (63.0 kB view details)

Uploaded Sep 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dscience_tools-2.0.1-py3-none-any.whl (40.4 kB view details)

Uploaded Sep 17, 2025 Python 3

File details

Details for the file dscience_tools-2.0.1.tar.gz.

File metadata

Download URL: dscience_tools-2.0.1.tar.gz
Upload date: Sep 17, 2025
Size: 63.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dscience_tools-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9fa33a9154242ee03a879301edf641dcab0ae634abe8592605c1e460e5b2ebc2`
MD5	`63d54576eac10bf8f6c5d810a41eab3f`
BLAKE2b-256	`5089317acb043d3f977270e16203382810e8c464b3377236c1e206366b395dbb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dscience_tools-2.0.1.tar.gz:

Publisher: python-publish.yml on s-kav/ds_tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dscience_tools-2.0.1.tar.gz
- Subject digest: 9fa33a9154242ee03a879301edf641dcab0ae634abe8592605c1e460e5b2ebc2
- Sigstore transparency entry: 529224560
- Sigstore integration time: Sep 17, 2025
Source repository:
- Permalink: s-kav/ds_tools@59f1f89c441207477eced858ac9f1a3cf3972b57
- Branch / Tag: refs/tags/v.2.0.1
- Owner: https://github.com/s-kav
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@59f1f89c441207477eced858ac9f1a3cf3972b57
- Trigger Event: release

File details

Details for the file dscience_tools-2.0.1-py3-none-any.whl.

File metadata

Download URL: dscience_tools-2.0.1-py3-none-any.whl
Upload date: Sep 17, 2025
Size: 40.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dscience_tools-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`73880c71f3d8a5037f8167ec38353a9fd005d05b24962232b9eeb2929b11bc5e`
MD5	`e977c42a5ccc808b024e0abc239f6dd5`
BLAKE2b-256	`63302bfcd7f294c7eb4ef85db7892840298bdb856e5c2819accaecde07cc25c4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dscience_tools-2.0.1-py3-none-any.whl:

Publisher: python-publish.yml on s-kav/ds_tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dscience_tools-2.0.1-py3-none-any.whl
- Subject digest: 73880c71f3d8a5037f8167ec38353a9fd005d05b24962232b9eeb2929b11bc5e
- Sigstore transparency entry: 529224568
- Sigstore integration time: Sep 17, 2025
Source repository:
- Permalink: s-kav/ds_tools@59f1f89c441207477eced858ac9f1a3cf3972b57
- Branch / Tag: refs/tags/v.2.0.1
- Owner: https://github.com/s-kav
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@59f1f89c441207477eced858ac9f1a3cf3972b57
- Trigger Event: release

dscience-tools 2.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

DSTools: Data Science Research Toolkit

Table of Contents

Short intro

Key Features

What's New in Version 2.0.0

TODO & Future Plans

Installation

Clone the Repository

Navigate to the Project Directory

Install Dependencies

Function Overview

Examples of usage

Using the Metrics Module

Using the Distance Module

Real-time Training Monitoring

Evaluate a classification model.

Generating a Synthetic Distribution.

Correlation Matrix Heatmap

Detailed Categorical Analysis.

Plot confusion matrix.

Available Tools

Core Toolkit (tools.*)

Metrics Toolkit (tools.metrics.*)

Distance Toolkit (tools.distance.*)

Authors

Contributing

References

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Core Toolkit (`tools.*`)

Metrics Toolkit (`tools.metrics.*`)

Distance Toolkit (`tools.distance.*`)