Skip to main content

Generic data handling utilities including data splitting and analysis.

Project description

dsr-data-tools

PyPI version Python versions License Changelog

Data analysis and exploration tools for exploratory data analysis (EDA).

Version 1.4.0: This release matures the Recommendation Engine into an Audit-Aware Framework. It introduces deterministic object hashing for data lineage and a metadata-driven discovery system for "Human-in-the-Loop" configuration.

Features

  • Dataset Analysis: Comprehensive statistical summaries and data quality assessment.
  • Data Exploration: Tools for understanding data distributions, correlations, and patterns.
  • Quality Metrics: Missing value detection, data type analysis, and anomaly identification.
  • Statistically Guided Feature Interactions: Automatic discovery of meaningful feature interactions using Mutual Information and Pearson Correlation.
  • Recommendation Engine: Intelligent pipeline for Boolean mapping, Numerical casting, and Datetime standardization with customizable execution priority.
  • User-Guided ColumnHints: Explicitly guide the engine with metadata for financial, geospatial, or temporal data to override automated heuristics.
  • Intelligent Boolean Mapping: Detects and standardizes diverse truthiness indicators (e.g., "Y/N", "Active/Inactive", "1/0") into proper boolean types.
  • Cyclic Feature Extraction: Decomposes datetimes into periodic Sine/Cosine features to preserve temporal relationships for machine learning.
  • Numerical Precision Optimization: Standardize decimal depth using configurable rounding modes (Nearest, Bankers, Up, Down).
  • Audit Lineage & Integrity: Generate deterministic fingerprints for DataFrames and Python objects using joblib-based memory buffer inspection.
  • Metadata-Driven Customization: Use class-level metadata to define "editable" fields, enabling seamless integration with YAML-based orchestration.
  • Memory-Efficient File Hashing: Chunked SHA-256 validation for raw data files, ensuring integrity on memory-constrained systems.

Installation

pip install dsr-data-tools
pip install dsr-data-tools

Usage

import pandas as pd
from dsr_data_tools import analyze_dataset

# Load your data
df = pd.read_csv('data.csv')

# Perform comprehensive analysis
analyze_dataset(df)

Datetime Conversion Recommendation

generate_recommendations() detects object/string columns that are likely datetimes and recommends converting them to a proper datetime dtype.

import pandas as pd
from dsr_data_tools.analysis import generate_recommendations
from dsr_data_tools.recommendations import apply_recommendations

# Example column with mostly valid date strings
df = pd.DataFrame({
 'date_str': [
  '2025-01-01', '2025-01-02', '2025-01-03',
  '2025-01-04', 'invalid',  # one invalid value
 ] * 10  # scale up rows
})

recs = generate_recommendations(df)

# If detected, apply the datetime conversion recommendation
if 'date_str' in recs and 'datetime_conversion' in recs['date_str']:
 df_converted = apply_recommendations(df, {
  'date_str': recs['date_str']['datetime_conversion']
 })
 # Column is now datetime64; invalid entries coerced to NaT
 print(df_converted['date_str'].dtype)  # datetime64[ns]

Boolean Classification

# The engine now handles semantic mapping, recognizing 'Y' as True
# based on common indicators rather than just alphabetical order
from dsr_data_tools.recommendations import BooleanClassificationRecommendation

df = pd.DataFrame({"active": ["Y", "N", "Y"]})
rec = BooleanClassificationRecommendation(
    column_name="active",
    description="Convert to bool",
    values=["Y", "N"]
)

# Returns [True, False, True]
df_bool = rec.apply(df)

Date Durations

Calculate the numeric duration between two datetime columns in specific units such as 'seconds', 'minutes', 'hours', or 'days'.

from dsr_data_tools.recommendations import DatetimeDurationRecommendation

rec = DatetimeDurationRecommendation(
    start_column="order_date",
    end_column="delivery_date",
    output_column="days_to_deliver",
    unit="days"
)

df = rec.apply(df)

Interactive Missing Value handling

The engine allows choosing between statistical imputation (mean/median/mode), constant filling, or row/column removal.

from dataclasses import fields

# Discover which fields are whitelisted for user edits in your pipeline
editable_fields = [
    f.name for f in fields(rec) 
    if f.metadata.get("editable", False)
]
# Returns: ['strategy', 'fill_value', 'notes', 'enabled', 'alias']

Guided Recommendations with ColumnHints

Users can provide a ColumnHint to specify the 'logical type' of a column and set constraints like rounding, bounds, or specific feature extraction needs.

import pandas as pd
from dsr_data_tools.analysis import RecommendationManager
from dsr_data_tools.recommendations import ColumnHint, RoundingMode

# Load data
df = pd.read_csv('data.csv')

# Define explicit hints to override or guide the engine
hints = {
    "unit_price": ColumnHint.financial(decimal_places=2, rounding_mode=RoundingMode.BANKERS),
    "user_id": ColumnHint.numeric(convert_to_int=True),
    "internal_notes": ColumnHint.ignore()
}

manager = RecommendationManager()
manager.generate_recommendations(df, hints=hints)

# Display the recommended pipeline
for rec in manager._pipeline:
    rec.info()

Data Integrity & Hashing

from dsr_data_tools.hashing import calculate_object_hash, calculate_file_hash
from pathlib import Path

# Generate a deterministic fingerprint of a DataFrame for audit tracking
df_hash = calculate_object_hash(df)
print(f"DataFrame Signature: {df_hash}")

# Verify raw data integrity before ingestion
file_path = Path("data/raw/adult.csv")
file_hash = calculate_file_hash(file_path)

Performance

This library is optimized for large-scale data processing using vectorized operations.

  • Vectorized Integer Checks: Optimized from $O(N)$ Python-level application to vectorized modulo operations, resulting in a 5-6× speed increase.
  • Cached Data Scans: Common operations like dropna() and unique() are cached to minimize redundant data scans across wide datasets.
  • Efficient Scaling: Outlier handling and scaling utilize NumPy vectorized operations and Scikit-Learn transformers for high throughput.

Benchmarks

A benchmark script compares per-element apply(is_integer) against a vectorized modulo check. On large series, the vectorized approach is typically 5–6× faster.

python scripts/benchmark_integer_checks.py           # default size (2,000,000)
python scripts/benchmark_integer_checks.py 5000000  # custom size

Or via Makefile target:

make benchmark                # default N=2,000,000
make benchmark N=5000000      # custom size

Requirements

  • Python >= 3.10
  • dsr-utils >= 1.4.0
  • joblib >= 1.4.0
  • numpy >= 2.4.4
  • pandas >= 3.0.2
  • scikit-learn >= 1.8.0

License

MIT License - see LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsr_data_tools-1.4.0.tar.gz (54.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsr_data_tools-1.4.0-py3-none-any.whl (51.9 kB view details)

Uploaded Python 3

File details

Details for the file dsr_data_tools-1.4.0.tar.gz.

File metadata

  • Download URL: dsr_data_tools-1.4.0.tar.gz
  • Upload date:
  • Size: 54.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dsr_data_tools-1.4.0.tar.gz
Algorithm Hash digest
SHA256 76620c51ac9d5443d442773504115822488eaa24a964ac30f4572565e1067cc2
MD5 69cbbbc7ae7cf3d6f2fbc588d75db20c
BLAKE2b-256 baba9c2eab78bb7f4b50e5ec06263b176761523f9120a2814f79918a0a0200ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for dsr_data_tools-1.4.0.tar.gz:

Publisher: python-publish.yml on scottroberts140/dsr-data-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dsr_data_tools-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: dsr_data_tools-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 51.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dsr_data_tools-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 732193afe7860562e3d9c9ac943a14ab1da745cc65e2bb0aa6c227723561efc0
MD5 f9adb1d86f3c90d4645e9b2932fc9394
BLAKE2b-256 ecbc0244d5e08d0aebeb3ac579e840c43c233b6f09b143b25a28db6e57da9d9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for dsr_data_tools-1.4.0-py3-none-any.whl:

Publisher: python-publish.yml on scottroberts140/dsr-data-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page