Skip to main content

A Python library for exploratory data analysis with advanced statistical features

Project description

DataPrism

A Python library for exploratory data analysis with data profiling, quality assessment, and stability monitoring.

Python 3.9+ License: MIT

Interactive Viewer

DataPrism includes a built-in interactive dashboard to explore your analysis results in the browser.

from dataprism import DataPrism, DataLoader

# Load data from CSV or Parquet
df = DataLoader.load_csv("data.csv")
# df = DataLoader.load_parquet("data.parquet")

# Run analysis and launch viewer
prism = DataPrism()
prism.analyze(
    data=df,
    target_variable="target",
    exclude_columns=["id", "split", "onboarding_date"],
    output_path="eda_results.json",
)
prism.view()

Summary — Dataset overview, insights, top features by IV, data quality score, and provider match rates.

Summary

Catalog — Sortable feature table with type, provider, target correlation, IV, and PSI at a glance.

Catalog

Deep Dive — Per-feature detail view with statistics, violin plots, distribution charts, PSI trend analysis, target associations, and correlations.

Deep Dive

Associations — Mixed-method heatmap (Pearson, Theil's U, Eta) showing relationships across all features.

Associations

How DataPrism Compares

Capability DataPrism ydata-profiling Sweetviz D-Tale AutoViz DataPrep
Predictive power (IV / WoE) 🟡 🟡
Drift detection (PSI) 🟡 🟡 🟡
Data quality score
Multi-source match rates
Schema-aware profiling 🟡 🟡 🟡 🟡
Structured JSON output 🟡 🟡
Interactive explorer 🟡 🟡

✅ Supported 🟡 Partial ➖ Not supported

Installation

pip install dataprism

Quick Start

from dataprism import DataPrism, DataLoader

df = DataLoader.load_csv("data.csv")

prism = DataPrism()
results = prism.analyze(
    data=df,
    exclude_columns=["customer_id", "created_at"],
    target_variable="target",
    output_path="eda_results.json"
)

For schema-aware profiling, stability analysis, and advanced configuration, see the Usage Guide.

Roadmap

DataPrism is being built for the AI era — where data analysis is increasingly driven by LLM agents, automated pipelines, and programmatic consumers rather than humans clicking through dashboards.

AI-Native Analysis

  • Natural language insights — Auto-generated plain-English summaries of each feature, anomalies, and recommendations that LLMs can directly incorporate into reports.

Closing the Gaps

  • Dataset comparison — Side-by-side train/test/production profiling with automatic drift highlights.
  • Scatter & pair plots — Interactive scatter matrices for continuous feature pairs with target coloring.
  • Auto-visualization — One-line generation of per-feature visual summaries exportable as images.
  • Spark/Dask support — Distributed computation for datasets that don't fit in memory.
  • Streaming analysis — Incremental profiling for real-time data pipelines without re-analyzing the full dataset.

Deeper Intelligence

  • Automated feature recommendations — Go beyond flagging issues to suggesting transformations (log, binning, encoding) based on distribution shape and target relationship.
  • Anomaly explanations — When outliers or drift are detected, surface the likely cause (data pipeline issues, population shift, seasonality).
  • Cross-dataset lineage — Track how feature distributions evolve across model versions and data refreshes.

Documentation

  • Usage Guide — schema, stability analysis, advanced configuration, provider match rates
  • Architecture — internals, module structure, data flow
  • Decision Records — key design decisions and rationale
  • Examples — usage examples and demos

Development

pip install -e .           # Install for development
python -m build            # Build package
python -m pytest tests/    # Run tests

Requirements

  • Python 3.9+
  • pandas >= 2.0.0
  • numpy >= 1.24.0
  • scipy >= 1.10.0
  • pyarrow >= 10.0.0 (for Parquet support)

License

MIT License - see LICENSE file for details.

Contact

For questions or suggestions:

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprism-0.1.5.tar.gz (89.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataprism-0.1.5-py3-none-any.whl (86.8 kB view details)

Uploaded Python 3

File details

Details for the file dataprism-0.1.5.tar.gz.

File metadata

  • Download URL: dataprism-0.1.5.tar.gz
  • Upload date:
  • Size: 89.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataprism-0.1.5.tar.gz
Algorithm Hash digest
SHA256 c3633f3608d7e5e59754e43a1d35f4d65d303b26586d658522120c32e0be2b90
MD5 7637c92f20ced2834a7fd232e96baf48
BLAKE2b-256 2a6bfceef1587afa138f7b03d6a01982de934708e03c6e56d18254b1e3c3c419

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataprism-0.1.5.tar.gz:

Publisher: pipeline.yaml on lattiq/dataprism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataprism-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: dataprism-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 86.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataprism-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2d3496cff0cdd70f5be2e293a7a17bcf35386e0807689d3048aa6fc8118cf225
MD5 7b784851b703eae65284f8b553fe35b2
BLAKE2b-256 1deb6a56ebc73d0c4460080fe726bd948199d65611a7b97a51d12c6cbb635578

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataprism-0.1.5-py3-none-any.whl:

Publisher: pipeline.yaml on lattiq/dataprism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page