Skip to main content

Generic data handling utilities including data splitting and analysis.

Project description

dsr-data-tools

Data analysis and exploration tools for exploratory data analysis (EDA).

Features

  • Dataset Analysis: Comprehensive statistical summaries and data quality assessment
  • Data Exploration: Tools for understanding data distributions, correlations, and patterns
  • Quality Metrics: Missing value detection, data type analysis, and anomaly identification
  • Statistically Guided Feature Interactions: Automatic discovery of meaningful feature interactions using Mutual Information and Pearson Correlation

Installation

pip install dsr-data-tools

Usage

import pandas as pd
from dsr_data_tools import analyze_dataset

# Load your data
df = pd.read_csv('data.csv')

# Perform comprehensive analysis
analyze_dataset(df)

Performance

This library is optimized for large-scale data processing using vectorized operations.

  • Vectorized Integer Checks: Optimized from $O(N)$ Python-level application to vectorized modulo operations, resulting in a 5-6× speed increase.

  • Cached Data Scans: Implemented caching for common operations like dropna() and unique() to ensure each data column is scanned as few times as possible, maintaining high efficiency for wide datasets.

Benchmarks

A benchmark script compares per-element apply(is_integer) against a vectorized modulo check for detecting integer-like floats. On large series, the vectorized approach is typically 5–6× faster.

Run via Python:

python scripts/benchmark_integer_checks.py           # default size (2,000,000)
python scripts/benchmark_integer_checks.py 5000000  # custom size

Or via Makefile target:

make benchmark                # default N=2,000,000
make benchmark N=5000000      # custom size

Requirements

  • Python >= 3.10
  • pandas
  • numpy
  • scikit-learn
  • dsr-utils

License

MIT License - see LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsr_data_tools-0.0.6.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsr_data_tools-0.0.6-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file dsr_data_tools-0.0.6.tar.gz.

File metadata

  • Download URL: dsr_data_tools-0.0.6.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dsr_data_tools-0.0.6.tar.gz
Algorithm Hash digest
SHA256 ebc415ff73dfe66a2dc9350109565bee937fc791bf1c9d2239e307ac1ef445c3
MD5 2885e12c05df8f4fcd770a892143a34a
BLAKE2b-256 072cb7d8f2350e5b01fa55aa36b8fab328e6b40cfe10616d69fe4c00d30a72e3

See more details on using hashes here.

File details

Details for the file dsr_data_tools-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: dsr_data_tools-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 21.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dsr_data_tools-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 493106f4cec22483e8b6227bf1fd7e933ae53f5d4dce9724c030f2360b0a3662
MD5 8f3d9cab4107bb3aa9c8f714ed779030
BLAKE2b-256 55a8a0818165813803dc950e64bb361e96f2a7abe1c2bf1f9ab188410dff07fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page