Skip to main content

Generic data handling utilities including data splitting and analysis.

Project description

dsr-data-tools

PyPI version Python versions License Changelog

Data analysis and exploration tools for exploratory data analysis (EDA).

Version 1.1.0: This release adds new functionality and bug fixes while remaining compatible with 1.0.0.

Features

  • Dataset Analysis: Comprehensive statistical summaries and data quality assessment
  • Data Exploration: Tools for understanding data distributions, correlations, and patterns
  • Quality Metrics: Missing value detection, data type analysis, and anomaly identification
  • Statistically Guided Feature Interactions: Automatic discovery of meaningful feature interactions using Mutual Information and Pearson Correlation

Installation

pip install dsr-data-tools

Usage

import pandas as pd
from dsr_data_tools import analyze_dataset

# Load your data
df = pd.read_csv('data.csv')

# Perform comprehensive analysis
analyze_dataset(df)

Datetime Conversion Recommendation

generate_recommendations() detects object/string columns that are likely datetimes and recommends converting them to a proper datetime dtype.

import pandas as pd
from dsr_data_tools.analysis import generate_recommendations
from dsr_data_tools.recommendations import apply_recommendations

# Example column with mostly valid date strings
df = pd.DataFrame({
	'date_str': [
		'2025-01-01', '2025-01-02', '2025-01-03',
		'2025-01-04', 'invalid',  # one invalid value
	] * 10  # scale up rows
})

recs = generate_recommendations(df)

# If detected, apply the datetime conversion recommendation
if 'date_str' in recs and 'datetime_conversion' in recs['date_str']:
	df_converted = apply_recommendations(df, {
		'date_str': recs['date_str']['datetime_conversion']
	})
	# Column is now datetime64; invalid entries coerced to NaT
	print(df_converted['date_str'].dtype)  # datetime64[ns]

Performance

This library is optimized for large-scale data processing using vectorized operations.

  • Vectorized Integer Checks: Optimized from $O(N)$ Python-level application to vectorized modulo operations, resulting in a 5-6× speed increase.

  • Cached Data Scans: Implemented caching for common operations like dropna() and unique() to ensure each data column is scanned as few times as possible, maintaining high efficiency for wide datasets.

Benchmarks

A benchmark script compares per-element apply(is_integer) against a vectorized modulo check for detecting integer-like floats. On large series, the vectorized approach is typically 5–6× faster.

Run via Python:

python scripts/benchmark_integer_checks.py           # default size (2,000,000)
python scripts/benchmark_integer_checks.py 5000000  # custom size

Or via Makefile target:

make benchmark                # default N=2,000,000
make benchmark N=5000000      # custom size

Requirements

  • Python >= 3.10
  • pandas
  • numpy
  • scikit-learn
  • dsr-utils >= 1.0.0

License

MIT License - see LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsr_data_tools-1.1.0.tar.gz (46.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsr_data_tools-1.1.0-py3-none-any.whl (44.1 kB view details)

Uploaded Python 3

File details

Details for the file dsr_data_tools-1.1.0.tar.gz.

File metadata

  • Download URL: dsr_data_tools-1.1.0.tar.gz
  • Upload date:
  • Size: 46.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dsr_data_tools-1.1.0.tar.gz
Algorithm Hash digest
SHA256 09be2107228f5275b8794ee170e47c04e94587078c98ce89eacd0368c5734a4c
MD5 01c498fd96718834eb49d066ec8a9c61
BLAKE2b-256 6ccc41ea63223333069242794e12b493c8a39d1b47e6d45829e4a4c23d580ff4

See more details on using hashes here.

Provenance

The following attestation bundles were made for dsr_data_tools-1.1.0.tar.gz:

Publisher: python-publish.yml on scottroberts140/dsr-data-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dsr_data_tools-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: dsr_data_tools-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dsr_data_tools-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1bbc26001a413a49fe15fc7e476c91b95fb72f5cc789d1c85914cbe2ece68dd9
MD5 3ffb8b6af1d55d5999cbf5887b44651b
BLAKE2b-256 0362b6b975706fad74f044e076978bbec8f1ddabd285d728d8362d2eaf06da3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for dsr_data_tools-1.1.0-py3-none-any.whl:

Publisher: python-publish.yml on scottroberts140/dsr-data-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page