Skip to main content

DSTools: Data Science Tools Library

Project description

DSTools: Data Science Tools Library

PyPI version License: MIT

Authors

DSTools is a Python library designed to assist data scientists and researchers by providing a collection of helpful functions for various stages of a data science project, from data exploration and preprocessing to model evaluation and synthetic data generation.

Table of Contents

Features

  • Data Exploration: Quickly get statistics for numerical and categorical features (describe_numeric, describe_categorical), check for missing values (check_NINF), and visualize correlations (corr_matrix).
  • Model Evaluation: Comprehensive classification model evaluation (evaluate_classification, compute_metrics) with clear visualizations (plot_confusion_matrix).
  • Data Preprocessing: Encode categorical variables (labeling), handle outliers (remove_outliers_iqr), and scale features (min_max_scale).
  • Time Series Analysis: Test for stationarity using the Dickey-Fuller test (test_stationarity).
  • Synthetic Data Generation: Create complex numerical distributions matching specific statistical moments (generate_distribution, generate_distribution_from_metrics).
  • Advanced Statistics: Calculate non-parametric correlation (chatterjee_correlation), entropy, and KL-divergence.
  • Utilities: Save/load DataFrames to/from ZIP archives, generate random alphanumeric codes, and more.

Installation

Clone the Repository

git clone https://github.com/s-kav/ds_tools.git

Navigate to the Project Directory

cd ds_tools

Install Dependencies

Ensure you have Python version 3.8 or higher and install the required packages:

pip install -r requirements.txt

Usage

Here's a simple example of how to use the library to evaluate a classification model.

import numpy as np
from ds_tools import DSTools

# 1. Initialize the toolkit
tools = DSTools()

# 2. Generate some dummy data
y_true = np.array([0, 1, 1, 0, 1, 0, 0, 1])
y_probs = np.array([0.1, 0.8, 0.6, 0.3, 0.9, 0.2, 0.4, 0.7])

# 3. Get a comprehensive evaluation report
# This will print metrics and show plots for ROC and Precision-Recall curves.
results = tools.evaluate_classification(true_labels=y_true, pred_probs=y_probs)

# The results are also returned as a dictionary
print(f"\nROC AUC Score: {results['roc_auc']:.4f}")

Full code base for other function testing you can find here.

Function Overview

The library provides a wide range of functions. To see a full, formatted list of available tools, you can use the function_list method:

from ds_tools import DSTools

tools = DSTools()
tools.function_list()

Example

Generating a Synthetic Distribution: need to create a dataset with specific statistical properties? generate_distribution_from_metrics can do that.

from ds_tools import DSTools, DistributionConfig

tools = DSTools()

# Define the desired metrics
metrics_config = DistributionConfig(
    mean=1042,
    median=330,
    std=1500,
    min_val=1,
    max_val=120000,
    skewness=13.2,
    kurtosis=245, # Excess kurtosis
    n=10000
)

# Generate the data
generated_data = tools.generate_distribution_from_metrics(n=10000, metrics=metrics_config)

print(f"Generated Mean: {np.mean(generated_data):.2f}")
print(f"Generated Std: {np.std(generated_data):.2f}")

Full code base for other function testing you can find here.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue on the GitHub repository.

To contribute:

Fork the repository. Create a new branch for your feature or bugfix. Commit your changes with clear messages. Push to your fork and submit a pull request. Please ensure your code adheres to PEP8 standards and includes appropriate docstrings and comments.

References

For citing you should use:

Sergii Kavun. (2025). s-kav/ds_tools: Version 0.9.1 (v0.9.1). Zenodo. https://doi.org/10.5281/zenodo.15864146

DOI

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dscience_tools-1.0.0.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dscience_tools-1.0.0-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file dscience_tools-1.0.0.tar.gz.

File metadata

  • Download URL: dscience_tools-1.0.0.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dscience_tools-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2c13bbb2e4dcf299afcace9af2f77f9851f4bb0c688e4b05484dec63b65977c3
MD5 62bd0e36c90b68f5b2bfbc00105259cb
BLAKE2b-256 9c398c3abd260018d1d5d640564595d55b603f07057447cd41086478c4341816

See more details on using hashes here.

Provenance

The following attestation bundles were made for dscience_tools-1.0.0.tar.gz:

Publisher: python-publish.yml on s-kav/ds_tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dscience_tools-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: dscience_tools-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dscience_tools-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1de715dfba887d824b9d9f244587a786aec998dbca525a4f44ad6fc01a93ac22
MD5 36ba7619b5bfbdbb41c3c318acc8ed15
BLAKE2b-256 93ec9a388515c3de3274fac7ec453530723c85bd585c3cdb01207bbf5e0a8a82

See more details on using hashes here.

Provenance

The following attestation bundles were made for dscience_tools-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on s-kav/ds_tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page