Generic data handling utilities including data splitting and analysis.
Project description
dsr-data-tools
Data analysis and exploration tools for exploratory data analysis (EDA).
Version 1.1.0: This release adds new functionality and bug fixes while remaining compatible with 1.0.0.
Features
- Dataset Analysis: Comprehensive statistical summaries and data quality assessment
- Data Exploration: Tools for understanding data distributions, correlations, and patterns
- Quality Metrics: Missing value detection, data type analysis, and anomaly identification
- Statistically Guided Feature Interactions: Automatic discovery of meaningful feature interactions using Mutual Information and Pearson Correlation
Installation
pip install dsr-data-tools
Usage
import pandas as pd
from dsr_data_tools import analyze_dataset
# Load your data
df = pd.read_csv('data.csv')
# Perform comprehensive analysis
analyze_dataset(df)
Datetime Conversion Recommendation
generate_recommendations() detects object/string columns that are likely datetimes and recommends converting them to a proper datetime dtype.
import pandas as pd
from dsr_data_tools.analysis import generate_recommendations
from dsr_data_tools.recommendations import apply_recommendations
# Example column with mostly valid date strings
df = pd.DataFrame({
'date_str': [
'2025-01-01', '2025-01-02', '2025-01-03',
'2025-01-04', 'invalid', # one invalid value
] * 10 # scale up rows
})
recs = generate_recommendations(df)
# If detected, apply the datetime conversion recommendation
if 'date_str' in recs and 'datetime_conversion' in recs['date_str']:
df_converted = apply_recommendations(df, {
'date_str': recs['date_str']['datetime_conversion']
})
# Column is now datetime64; invalid entries coerced to NaT
print(df_converted['date_str'].dtype) # datetime64[ns]
Performance
This library is optimized for large-scale data processing using vectorized operations.
-
Vectorized Integer Checks: Optimized from $O(N)$ Python-level application to vectorized modulo operations, resulting in a 5-6× speed increase.
-
Cached Data Scans: Implemented caching for common operations like dropna() and unique() to ensure each data column is scanned as few times as possible, maintaining high efficiency for wide datasets.
Benchmarks
A benchmark script compares per-element apply(is_integer) against a vectorized modulo check for detecting integer-like floats. On large series, the vectorized approach is typically 5–6× faster.
Run via Python:
python scripts/benchmark_integer_checks.py # default size (2,000,000)
python scripts/benchmark_integer_checks.py 5000000 # custom size
Or via Makefile target:
make benchmark # default N=2,000,000
make benchmark N=5000000 # custom size
Requirements
- Python >= 3.10
- pandas
- numpy
- scikit-learn
- dsr-utils >= 1.0.0
License
MIT License - see LICENSE file for details
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dsr_data_tools-1.1.0.tar.gz.
File metadata
- Download URL: dsr_data_tools-1.1.0.tar.gz
- Upload date:
- Size: 46.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09be2107228f5275b8794ee170e47c04e94587078c98ce89eacd0368c5734a4c
|
|
| MD5 |
01c498fd96718834eb49d066ec8a9c61
|
|
| BLAKE2b-256 |
6ccc41ea63223333069242794e12b493c8a39d1b47e6d45829e4a4c23d580ff4
|
Provenance
The following attestation bundles were made for dsr_data_tools-1.1.0.tar.gz:
Publisher:
python-publish.yml on scottroberts140/dsr-data-tools
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dsr_data_tools-1.1.0.tar.gz -
Subject digest:
09be2107228f5275b8794ee170e47c04e94587078c98ce89eacd0368c5734a4c - Sigstore transparency entry: 937708305
- Sigstore integration time:
-
Permalink:
scottroberts140/dsr-data-tools@07aa5660a7ecde6a0e602cf0fdef800c4a177b6c -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/scottroberts140
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@07aa5660a7ecde6a0e602cf0fdef800c4a177b6c -
Trigger Event:
release
-
Statement type:
File details
Details for the file dsr_data_tools-1.1.0-py3-none-any.whl.
File metadata
- Download URL: dsr_data_tools-1.1.0-py3-none-any.whl
- Upload date:
- Size: 44.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bbc26001a413a49fe15fc7e476c91b95fb72f5cc789d1c85914cbe2ece68dd9
|
|
| MD5 |
3ffb8b6af1d55d5999cbf5887b44651b
|
|
| BLAKE2b-256 |
0362b6b975706fad74f044e076978bbec8f1ddabd285d728d8362d2eaf06da3c
|
Provenance
The following attestation bundles were made for dsr_data_tools-1.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on scottroberts140/dsr-data-tools
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dsr_data_tools-1.1.0-py3-none-any.whl -
Subject digest:
1bbc26001a413a49fe15fc7e476c91b95fb72f5cc789d1c85914cbe2ece68dd9 - Sigstore transparency entry: 937708307
- Sigstore integration time:
-
Permalink:
scottroberts140/dsr-data-tools@07aa5660a7ecde6a0e602cf0fdef800c4a177b6c -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/scottroberts140
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@07aa5660a7ecde6a0e602cf0fdef800c4a177b6c -
Trigger Event:
release
-
Statement type: