Skip to main content

A Python library for exploratory data analysis with advanced statistical features

Project description

DataPrism

A Python library for exploratory data analysis with data profiling, quality assessment, and stability monitoring.

Python 3.9+ License: MIT

Interactive Viewer

DataPrism includes a built-in interactive dashboard to explore your analysis results in the browser.

from dataprism import DataPrism, DataLoader

# Load data from CSV or Parquet
df = DataLoader.load_csv("data.csv")
# df = DataLoader.load_parquet("data.parquet")

# Run analysis and launch viewer
prism = DataPrism()
prism.analyze(
    data=df,
    target_variable="target",
    exclude_columns=["id", "split", "onboarding_date"],
    output_path="eda_results.json",
)
prism.view()

Summary — Dataset overview, insights, top features by IV, data quality score, and provider match rates.

Summary

Catalog — Sortable feature table with type, provider, target correlation, IV, and PSI at a glance.

Catalog

Deep Dive — Per-feature detail view with statistics, violin plots, distribution charts, PSI trend analysis, target associations, and correlations.

Deep Dive

Associations — Mixed-method heatmap (Pearson, Theil's U, Eta) showing relationships across all features.

Associations

How DataPrism Compares

Capability DataPrism ydata-profiling Sweetviz D-Tale AutoViz DataPrep
Predictive power (IV / WoE) 🟡 🟡
Drift detection (PSI) 🟡 🟡 🟡
Data quality score
Multi-source match rates
Schema-aware profiling 🟡 🟡 🟡 🟡
Structured JSON output 🟡 🟡
Interactive explorer 🟡 🟡

✅ Supported 🟡 Partial ➖ Not supported

Installation

pip install dataprism

Quick Start

from dataprism import DataPrism, DataLoader

df = DataLoader.load_csv("data.csv")

prism = DataPrism()
results = prism.analyze(
    data=df,
    exclude_columns=["customer_id", "created_at"],
    target_variable="target",
    output_path="eda_results.json"
)

For schema-aware profiling, stability analysis, and advanced configuration, see the Usage Guide.

Roadmap

DataPrism is being built for the AI era — where data analysis is increasingly driven by LLM agents, automated pipelines, and programmatic consumers rather than humans clicking through dashboards.

AI-Native Analysis

  • Natural language insights — Auto-generated plain-English summaries of each feature, anomalies, and recommendations that LLMs can directly incorporate into reports.

Closing the Gaps

  • Dataset comparison — Side-by-side train/test/production profiling with automatic drift highlights.
  • Scatter & pair plots — Interactive scatter matrices for continuous feature pairs with target coloring.
  • Auto-visualization — One-line generation of per-feature visual summaries exportable as images.
  • Spark/Dask support — Distributed computation for datasets that don't fit in memory.
  • Streaming analysis — Incremental profiling for real-time data pipelines without re-analyzing the full dataset.

Deeper Intelligence

  • Automated feature recommendations — Go beyond flagging issues to suggesting transformations (log, binning, encoding) based on distribution shape and target relationship.
  • Anomaly explanations — When outliers or drift are detected, surface the likely cause (data pipeline issues, population shift, seasonality).
  • Cross-dataset lineage — Track how feature distributions evolve across model versions and data refreshes.

Documentation

  • Usage Guide — schema, stability analysis, advanced configuration, provider match rates
  • Architecture — internals, module structure, data flow
  • Decision Records — key design decisions and rationale
  • Examples — usage examples and demos

Development

pip install -e .           # Install for development
python -m build            # Build package
python -m pytest tests/    # Run tests

Requirements

  • Python 3.9+
  • pandas >= 2.0.0
  • numpy >= 1.24.0
  • scipy >= 1.10.0
  • pyarrow >= 10.0.0 (for Parquet support)

License

MIT License - see LICENSE file for details.

Contact

For questions or suggestions:

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprism-0.1.6.tar.gz (89.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataprism-0.1.6-py3-none-any.whl (86.7 kB view details)

Uploaded Python 3

File details

Details for the file dataprism-0.1.6.tar.gz.

File metadata

  • Download URL: dataprism-0.1.6.tar.gz
  • Upload date:
  • Size: 89.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataprism-0.1.6.tar.gz
Algorithm Hash digest
SHA256 767a9c6de0288546b8d8f55f405957d7c5b4b0add3d563354a80dc78e5e0c276
MD5 247bbca2fd4a7d924587cce8988bb665
BLAKE2b-256 adcb902ce9512cb7a5d5e2d41eb0e0e163e259d085b2d9f882ee576f5cc2b478

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataprism-0.1.6.tar.gz:

Publisher: pipeline.yaml on lattiq/dataprism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataprism-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: dataprism-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 86.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataprism-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b6613b5c22c7c48a61fe0b033048342f9bd8dd595aa5841b13cac3941fa0a9c2
MD5 79eb5166a97985937579dd2555ba308a
BLAKE2b-256 baceb5638159a02ea12bc8ee98965908cb11fb99ae519a46cc9cf90586f066a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataprism-0.1.6-py3-none-any.whl:

Publisher: pipeline.yaml on lattiq/dataprism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page