A Python library for exploratory data analysis with advanced statistical features

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

DataPrism

A comprehensive Python library for exploratory data analysis with advanced features for data profiling, quality assessment, and stability monitoring.

Interactive Viewer

DataPrism includes a built-in interactive dashboard to explore your analysis results in the browser.

from dataprism import DataPrism, DataLoader

# Load data from CSV or Parquet
df = DataLoader.load_csv("data.csv")
# df = DataLoader.load_parquet("data.parquet")

# Run analysis and launch viewer
prism = DataPrism()
prism.analyze(
    data=df,
    target_variable="target",
    exclude_columns=["id", "split", "onboarding_date"],
    output_path="eda_results.json",
)
prism.view()

Summary — Dataset overview, insights, top features by IV, data quality score, and provider match rates.

Summary

Catalog — Sortable feature table with type, provider, target correlation, IV, and PSI at a glance.

Catalog

Deep Dive — Per-feature detail view with statistics, violin plots, distribution charts, PSI trend analysis, target associations, and correlations.

Deep Dive

Associations — Mixed-method heatmap (Pearson, Theil's U, Eta) showing relationships across all features.

Associations

How DataPrism Compares

Feature	DataPrism	Sweetviz	ydata-profiling	AutoViz	D-Tale	DataPrep
Programmatic API	Yes	Yes	Yes	Yes	Yes	Yes
Interactive Viewer	Yes	Yes	Yes	Partial	Yes	Yes
Correlation Analysis	Pearson, Spearman, Theil's U, Eta	Pearson, UC, Eta	Pearson, Spearman, Kendall, Phi-k	Pearson	Pearson, PPS	Pearson, Spearman, Kendall
Histogram / Bar Chart	Yes	Yes	Yes	Yes	Yes	Yes
Box Plot	Yes	—	Yes	—	Yes	Yes
Association Heatmap	Yes	Yes	Yes	—	Yes	Yes
Target-Overlaid Distribution	Yes	Yes	—	—	—	—
Scatter / Pair Plot	—	—	Yes	Yes	Yes	Yes
Violin Plot	Yes	—	—	Yes	—	—
Time Series / Trend	Yes	—	Yes	—	Yes	—
Schema-Driven Analysis	Yes	Partial	Yes	—	Partial	Partial
Mixed-Type Associations	Yes	Yes	Yes	—	Partial	Partial
Structured JSON Export	Yes	—	Yes	—	Partial	—
Target Analysis (IV/WoE)	Yes	—	—	—	—	—
Drift / PSI Stability	Yes	—	—	—	—	—
Data Quality Score	Yes	—	—	—	—	—
Sentinel Value Handling	Yes	—	—	—	—	—
Provider Match Rates	Yes	—	—	—	—	—

Where DataPrism leads: Schema-aware profiling with column roles and sentinel codes, IV/WoE for credit risk, PSI-based stability monitoring (cohort + time-based), automated data quality scoring, and provider-level match rates. No other EDA library covers these out of the box.

Where DataPrism lags: No dataset comparison (train vs test side-by-side), no auto-visualization per feature, and no Spark/Dask support for distributed datasets. These are on the roadmap.

Roadmap

DataPrism is being built for the AI era — where data analysis is increasingly driven by LLM agents, automated pipelines, and programmatic consumers rather than humans clicking through dashboards.

AI-Native Analysis

LLM-consumable output — Structured JSON output designed for AI agents to read, reason about, and act on. No screen-scraping HTML reports or parsing PDFs.
Natural language insights — Auto-generated plain-English summaries of each feature, anomalies, and recommendations that LLMs can directly incorporate into reports.
Agent-friendly API — Minimal, predictable interface (analyze() → view()) that AI coding assistants can invoke without ambiguity. Schema-driven configuration over magic defaults.

Closing the Gaps

Dataset comparison — Side-by-side train/test/production profiling with automatic drift highlights.
Scatter & pair plots — Interactive scatter matrices for continuous feature pairs with target coloring.
Auto-visualization — One-line generation of per-feature visual summaries exportable as images.
Spark/Dask support — Distributed computation for datasets that don't fit in memory.
Streaming analysis — Incremental profiling for real-time data pipelines without re-analyzing the full dataset.

Deeper Intelligence

Automated feature recommendations — Go beyond flagging issues to suggesting transformations (log, binning, encoding) based on distribution shape and target relationship.
Anomaly explanations — When outliers or drift are detected, surface the likely cause (data pipeline issues, population shift, seasonality).
Cross-dataset lineage — Track how feature distributions evolve across model versions and data refreshes.

Features

Automated Feature Analysis — Continuous and categorical profiling with automatic type inference and missing value detection
Target Relationship Analysis — Information Value (IV), Weight of Evidence (WoE), optimal binning, predictive power classification
Correlation & Association Analysis — Pearson, Spearman, Theil's U, Eta with unified association matrix across all feature types
Quality Assessment — Automated scoring (0-10), per-feature quality flags, actionable recommendations
Sentinel Value Handling — Automatic detection and replacement of no-hit values with nullable type preservation
Cohort-Based Stability — PSI and KS test for train/test drift detection
Time-Based Stability — Monthly, weekly, quartile, or custom time windows with temporal trend analysis
Provider Match Rates — Automatic data coverage statistics by provider
Large Dataset Support — CSV and Parquet formats, chunked reading, configurable sampling

Installation

pip install dataprism

Quick Start

Basic Usage

from dataprism import DataPrism, DataLoader
import pandas as pd

# Option 1: Load from file using DataLoader
df = DataLoader.load_csv("data.csv")

# Option 2: Use existing DataFrame
df = pd.read_csv("data.csv")  # or from database, etc.

# Initialize prism
prism = DataPrism(
    max_categories=50,
    top_correlations=10
)

# Run analysis (exclude non-feature columns when no schema is available)
results = prism.analyze(
    data=df,
    exclude_columns=["customer_id", "created_at"],
    target_variable="target",
    output_path="eda_results.json"
)

With DatasetSchema

from dataprism import (
    DataPrism, DataLoader,
    ColumnConfig, ColumnType, ColumnRole, Sentinels, DatasetSchema,
)

# Load data and schema
df = DataLoader.load_csv("data.csv")
schema = DataLoader.load_schema("schema.json")

# Or create schema programmatically
schema = DatasetSchema([
    ColumnConfig('age', ColumnType.CONTINUOUS, ColumnRole.FEATURE,
                 provider='demographics', description='User age',
                 sentinels=Sentinels(not_found='-1')),
    ColumnConfig('zip_code', ColumnType.CATEGORICAL, ColumnRole.FEATURE,
                 provider='address', description='ZIP code',
                 sentinels=Sentinels(not_found='', missing='00000')),
    ColumnConfig('target', ColumnType.BINARY, ColumnRole.TARGET),
])

# Run with schema
prism = DataPrism()
results = prism.analyze(
    data=df,
    schema=schema,
    target_variable="target",
    output_path="eda_results.json"
)

Schema JSON format (schema.json):

{
  "columns": [
    {
      "name": "age",
      "type": "continuous",
      "role": "feature",
      "provider": "demographics",
      "description": "User age",
      "sentinels": {
        "not_found": "-1",
        "missing": null
      }
    }
  ]
}

Stability Analysis

Cohort-Based (Train/Test)

from dataprism import DataPrism, DataLoader

# Load data and schema
df = DataLoader.load_parquet("data.parquet")
schema = DataLoader.load_schema("schema.json")

# Configure for stability analysis
prism = DataPrism(
    calculate_stability=True,
    cohort_column='dataTag',
    baseline_cohort='training',
    comparison_cohort='test'
)

results = prism.analyze(
    data=df,
    schema=schema
)

Time-Based

from dataprism import DataPrism, DataLoader

# Load data and schema
df = DataLoader.load_parquet("data.parquet")
schema = DataLoader.load_schema("schema.json")

# Configure for time-based stability
prism = DataPrism(
    time_based_stability=True,
    time_column='onboarding_time',
    time_window_strategy='monthly',  # or 'weekly', 'quartiles', 'custom'
    baseline_period='first',
    comparison_periods='all',
    min_samples_per_period=100
)

results = prism.analyze(
    data=df,
    schema=schema
)

Development

pip install -e .           # Install for development
python -m build            # Build package
python -m pytest tests/    # Run tests

Documentation

Architecture — internals, module structure, data flow
Usage Guide — advanced configuration, provider match rates, feature counts reference
Decision Records — key design decisions and rationale
Examples — usage examples and demos

Requirements

Python 3.9+
pandas >= 2.0.0
numpy >= 1.24.0
scipy >= 1.10.0
pyarrow >= 10.0.0 (for Parquet support)

License

MIT License - see LICENSE file for details.

Contact

For questions or suggestions:

Email: dev@lattiq.com
GitHub: https://github.com/lattiq/dataprism

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gunasekar lattiq

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

Apr 3, 2026

0.1.6

Mar 31, 2026

0.1.5

Mar 31, 2026

0.1.4

Mar 31, 2026

0.1.3

Mar 12, 2026

This version

0.1.2

Mar 7, 2026

0.1.1

Feb 19, 2026

0.1.0

Feb 18, 2026

0.0.1

Jul 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprism-0.1.2.tar.gz (67.4 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataprism-0.1.2-py3-none-any.whl (63.5 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file dataprism-0.1.2.tar.gz.

File metadata

Download URL: dataprism-0.1.2.tar.gz
Upload date: Mar 7, 2026
Size: 67.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataprism-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`d4810713a576f8d140c0a9378b8d5b56b55cb114e5fa6c0f0023c08583482365`
MD5	`28596b779505615b140cbb3f789670dc`
BLAKE2b-256	`651130523d8ef20af14594efe49fb4f911d087165fb275875ce20c19c87d72d8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataprism-0.1.2.tar.gz:

Publisher: publish.yml on lattiq/dataprism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataprism-0.1.2.tar.gz
- Subject digest: d4810713a576f8d140c0a9378b8d5b56b55cb114e5fa6c0f0023c08583482365
- Sigstore transparency entry: 1055742149
- Sigstore integration time: Mar 7, 2026
Source repository:
- Permalink: lattiq/dataprism@103dbffa03fd6bb1067cca3387c2339e0fe451e6
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/lattiq
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@103dbffa03fd6bb1067cca3387c2339e0fe451e6
- Trigger Event: release

File details

Details for the file dataprism-0.1.2-py3-none-any.whl.

File metadata

Download URL: dataprism-0.1.2-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 63.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataprism-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`687d98377a39ff67d908e2e7746ce6e390c249dd4e67a313d92d6ea2a63d9753`
MD5	`6cd080a02c2fe2674a854c49dfd53ddb`
BLAKE2b-256	`64762995d5bd1bbffadd53cce572a1aefabc4f17c8f7547bbe470b8f8f0cd1ef`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataprism-0.1.2-py3-none-any.whl:

Publisher: publish.yml on lattiq/dataprism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataprism-0.1.2-py3-none-any.whl
- Subject digest: 687d98377a39ff67d908e2e7746ce6e390c249dd4e67a313d92d6ea2a63d9753
- Sigstore transparency entry: 1055742214
- Sigstore integration time: Mar 7, 2026
Source repository:
- Permalink: lattiq/dataprism@103dbffa03fd6bb1067cca3387c2339e0fe451e6
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/lattiq
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@103dbffa03fd6bb1067cca3387c2339e0fe451e6
- Trigger Event: release

dataprism 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

DataPrism

Interactive Viewer

How DataPrism Compares

Roadmap

AI-Native Analysis

Closing the Gaps

Deeper Intelligence

Features

Installation

Quick Start

Basic Usage

With DatasetSchema

Stability Analysis

Cohort-Based (Train/Test)

Time-Based

Development

Documentation

Requirements

License

Contact

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance