Generic data handling utilities including data splitting and analysis.

These details have not been verified by PyPI

Project description

dsr-data-tools

Data analysis and exploration tools for exploratory data analysis (EDA).

Version 1.2.0: This release introduces a more robust Recommendation Engine, including semantic boolean mapping, optimized datetime duration calculations, and enhanced execution priority management.

Features

Dataset Analysis: Comprehensive statistical summaries and data quality assessment
Data Exploration: Tools for understanding data distributions, correlations, and patterns
Quality Metrics: Missing value detection, data type analysis, and anomaly identification
Statistically Guided Feature Interactions: Automatic discovery of meaningful feature interactions using Mutual Information and Pearson Correlation
Recommendation Engine: Intelligent pipeline for Boolean mapping, Numerical casting, and Datetime standardization with customizable execution priority
Intelligent Boolean Mapping: Detects and standardizes diverse truthiness indicators (e.g., "Y/N", "Active/Inactive", "1/0") into proper boolean types.
Explicit Numerical Casting: Dedicated workers for Float and Integer conversions that handle "Float-as-String" and dirty data safely.

Installation

pip install dsr-data-tools

Usage

import pandas as pd
from dsr_data_tools import analyze_dataset

# Load your data
df = pd.read_csv('data.csv')

# Perform comprehensive analysis
analyze_dataset(df)

Datetime Conversion Recommendation

generate_recommendations() detects object/string columns that are likely datetimes and recommends converting them to a proper datetime dtype.

import pandas as pd
from dsr_data_tools.analysis import generate_recommendations
from dsr_data_tools.recommendations import apply_recommendations

# Example column with mostly valid date strings
df = pd.DataFrame({
 'date_str': [
  '2025-01-01', '2025-01-02', '2025-01-03',
  '2025-01-04', 'invalid',  # one invalid value
 ] * 10  # scale up rows
})

recs = generate_recommendations(df)

# If detected, apply the datetime conversion recommendation
if 'date_str' in recs and 'datetime_conversion' in recs['date_str']:
 df_converted = apply_recommendations(df, {
  'date_str': recs['date_str']['datetime_conversion']
 })
 # Column is now datetime64; invalid entries coerced to NaT
 print(df_converted['date_str'].dtype)  # datetime64[ns]

Boolean Classification

# The engine now handles semantic mapping, recognizing 'Y' as True
# based on common indicators rather than just alphabetical order
from dsr_data_tools.recommendations import BooleanClassificationRecommendation

df = pd.DataFrame({"active": ["Y", "N", "Y"]})
rec = BooleanClassificationRecommendation(
    column_name="active",
    description="Convert to bool",
    values=["Y", "N"]
)

# Returns [True, False, True]
df_bool = rec.apply(df)

Date Durations

# Calculate the difference between two datetime columns in a specific unit.
from dsr_data_tools.recommendations import DatetimeDurationRecommendation

rec = DatetimeDurationRecommendation(
    column_name="start_date",
    start_column="start_date",
    end_column="end_date",
    output_column="days_to_complete",
    unit="days"  # Supports 'seconds', 'minutes', 'hours', 'days'
)

df_duration = rec.apply(df)

Performance

This library is optimized for large-scale data processing using vectorized operations.

Vectorized Integer Checks: Optimized from $O(N)$ Python-level application to vectorized modulo operations, resulting in a 5-6× speed increase.
Cached Data Scans: Implemented caching for common operations like dropna() and unique() to ensure each data column is scanned as few times as possible, maintaining high efficiency for wide datasets.

Benchmarks

A benchmark script compares per-element apply(is_integer) against a vectorized modulo check for detecting integer-like floats. On large series, the vectorized approach is typically 5–6× faster.

Script: scripts/benchmark_integer_checks.py

Run via Python:

python scripts/benchmark_integer_checks.py           # default size (2,000,000)
python scripts/benchmark_integer_checks.py 5000000  # custom size

Or via Makefile target:

make benchmark                # default N=2,000,000
make benchmark N=5000000      # custom size

Requirements

Python >= 3.10
dsr-utils >= 1.3.0
numpy >= 2.4.4
pandas >= 3.0.2
scikit-learn >= 1.8.0

License

MIT License - see LICENSE file for details

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.1.0

Apr 22, 2026

2.0.0

Apr 22, 2026

1.4.2

Apr 21, 2026

1.4.1

Apr 14, 2026

1.4.0

Apr 14, 2026

1.3.0

Apr 13, 2026

This version

1.2.0

Apr 9, 2026

1.1.0

Feb 10, 2026

1.0.0

Feb 9, 2026

0.0.6

Dec 20, 2025

0.0.5

Dec 20, 2025

0.0.4

Dec 19, 2025

0.0.3

Dec 19, 2025

0.0.2

Dec 18, 2025

0.0.1

Dec 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsr_data_tools-1.2.0.tar.gz (46.9 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dsr_data_tools-1.2.0-py3-none-any.whl (45.3 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file dsr_data_tools-1.2.0.tar.gz.

File metadata

Download URL: dsr_data_tools-1.2.0.tar.gz
Upload date: Apr 9, 2026
Size: 46.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dsr_data_tools-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`19b3c7954bdcdb4c31f077905159c1c82f0d006054e53556f75d2bcda067a4e5`
MD5	`646fade780c07d7ae3967cef6a04d61d`
BLAKE2b-256	`365229afdf72bf1b0b217bfcc547554848009fc6e0349fe7dfac56bf9b285183`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dsr_data_tools-1.2.0.tar.gz:

Publisher: python-publish.yml on scottroberts140/dsr-data-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dsr_data_tools-1.2.0.tar.gz
- Subject digest: 19b3c7954bdcdb4c31f077905159c1c82f0d006054e53556f75d2bcda067a4e5
- Sigstore transparency entry: 1265607950
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: scottroberts140/dsr-data-tools@b5d848fd6d107b29d1dfd173ad017875991d6c84
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/scottroberts140
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@b5d848fd6d107b29d1dfd173ad017875991d6c84
- Trigger Event: release

File details

Details for the file dsr_data_tools-1.2.0-py3-none-any.whl.

File metadata

Download URL: dsr_data_tools-1.2.0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 45.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dsr_data_tools-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5aabd8b4da2afa2d5a93030d39fb7909bf23df6ef46d489901c670f8ed1a3832`
MD5	`c38d39dcc2dde32c2466facc63d0f3af`
BLAKE2b-256	`973891a9ff1c91520a4cbefee165a3cd822d9c84d5b058ab8b36c08ffa64c99b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dsr_data_tools-1.2.0-py3-none-any.whl:

Publisher: python-publish.yml on scottroberts140/dsr-data-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dsr_data_tools-1.2.0-py3-none-any.whl
- Subject digest: 5aabd8b4da2afa2d5a93030d39fb7909bf23df6ef46d489901c670f8ed1a3832
- Sigstore transparency entry: 1265608019
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: scottroberts140/dsr-data-tools@b5d848fd6d107b29d1dfd173ad017875991d6c84
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/scottroberts140
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@b5d848fd6d107b29d1dfd173ad017875991d6c84
- Trigger Event: release

dsr-data-tools 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

dsr-data-tools

Features

Installation

Usage

Datetime Conversion Recommendation

Boolean Classification

Date Durations

Performance

Benchmarks

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance