A Python library for exploratory data analysis with advanced statistical features

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

DataPrism

A comprehensive Python library for exploratory data analysis with advanced features for data profiling, quality assessment, and stability monitoring.

Interactive Viewer

DataPrism includes a built-in interactive dashboard to explore your analysis results in the browser.

from dataprism.viewer.server import serve_results

# From a saved JSON file
serve_results("eda_results.json")

# Or directly from EDA results
results = runner.run(data=df, schema=schema, target_variable="target")
serve_results(results)

Summary — Dataset overview, insights, top features by IV, data quality score, and provider match rates.

Summary

Catalog — Sortable feature table with type, provider, target correlation, IV, and PSI at a glance.

Catalog

Deep Dive — Per-feature detail view with statistics, box plots, distribution charts, target associations, and correlations.

Deep Dive

Associations — Mixed-method heatmap (Pearson, Theil's U, Eta) showing relationships across all features.

Associations

Features

Automated Feature Analysis — Continuous and categorical profiling with automatic type inference and missing value detection
Target Relationship Analysis — Information Value (IV), Weight of Evidence (WoE), optimal binning, predictive power classification
Correlation & Association Analysis — Pearson, Spearman, Theil's U, Eta with unified association matrix across all feature types
Quality Assessment — Automated scoring (0-10), per-feature quality flags, actionable recommendations
Sentinel Value Handling — Automatic detection and replacement of no-hit values with nullable type preservation
Cohort-Based Stability — PSI and KS test for train/test drift detection
Time-Based Stability — Monthly, weekly, quartile, or custom time windows with temporal trend analysis
Provider Match Rates — Automatic data coverage statistics by provider
Large Dataset Support — CSV and Parquet formats, chunked reading, configurable sampling

Installation

pip install dataprism

Quick Start

Basic Usage

from dataprism import EDARunner, DataLoader
import pandas as pd

# Option 1: Load from file using DataLoader
df = DataLoader.load_csv("data.csv")

# Option 2: Use existing DataFrame
df = pd.read_csv("data.csv")  # or from database, etc.

# Initialize runner
runner = EDARunner(
    max_categories=50,
    top_correlations=10
)

# Run analysis
results = runner.run(
    data=df,
    output_path="eda_results.json"
)

With DatasetSchema

from dataprism import (
    EDARunner, DataLoader,
    ColumnConfig, ColumnType, ColumnRole, Sentinels, DatasetSchema,
)

# Load data and schema
df = DataLoader.load_csv("data.csv")
schema = DataLoader.load_schema("schema.json")

# Or create schema programmatically
schema = DatasetSchema([
    ColumnConfig('age', ColumnType.CONTINUOUS, ColumnRole.FEATURE,
                 provider='demographics', description='User age',
                 sentinels=Sentinels(not_found='-1')),
    ColumnConfig('zip_code', ColumnType.CATEGORICAL, ColumnRole.FEATURE,
                 provider='address', description='ZIP code',
                 sentinels=Sentinels(not_found='', missing='00000')),
    ColumnConfig('target', ColumnType.BINARY, ColumnRole.TARGET),
])

# Run with schema
runner = EDARunner()
results = runner.run(
    data=df,
    schema=schema,
    target_variable="target",
    output_path="eda_results.json"
)

Schema JSON format (schema.json):

{
  "columns": [
    {
      "name": "age",
      "type": "continuous",
      "role": "feature",
      "provider": "demographics",
      "description": "User age",
      "sentinels": {
        "not_found": "-1",
        "missing": null
      }
    }
  ]
}

Stability Analysis

Cohort-Based (Train/Test)

from dataprism import EDARunner, DataLoader

# Load data and schema
df = DataLoader.load_parquet("data.parquet")
schema = DataLoader.load_schema("schema.json")

# Configure for stability analysis
runner = EDARunner(
    calculate_stability=True,
    cohort_column='dataTag',
    baseline_cohort='training',
    comparison_cohort='test'
)

results = runner.run(
    data=df,
    schema=schema
)

Time-Based

from dataprism import EDARunner, DataLoader

# Load data and schema
df = DataLoader.load_parquet("data.parquet")
schema = DataLoader.load_schema("schema.json")

# Configure for time-based stability
runner = EDARunner(
    time_based_stability=True,
    time_column='onboarding_time',
    time_window_strategy='monthly',  # or 'weekly', 'quartiles', 'custom'
    baseline_period='first',
    comparison_periods='all',
    min_samples_per_period=100
)

results = runner.run(
    data=df,
    schema=schema
)

Development

pip install -e .           # Install for development
python -m build            # Build package
python -m pytest tests/    # Run tests

Documentation

Architecture — internals, module structure, data flow
Usage Guide — advanced configuration, provider match rates, feature counts reference
Decision Records — key design decisions and rationale
Examples — usage examples and demos

Requirements

Python 3.9+
pandas >= 2.0.0
numpy >= 1.24.0
scipy >= 1.10.0
pyarrow >= 10.0.0 (for Parquet support)

License

MIT License - see LICENSE file for details.

Contact

For questions or suggestions:

Email: dev@lattiq.com
GitHub: https://github.com/lattiq/dataprism

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gunasekar lattiq

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

Apr 3, 2026

0.1.6

Mar 31, 2026

0.1.5

Mar 31, 2026

0.1.4

Mar 31, 2026

0.1.3

Mar 12, 2026

0.1.2

Mar 7, 2026

0.1.1

Feb 19, 2026

This version

0.1.0

Feb 18, 2026

0.0.1

Jul 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprism-0.1.0.tar.gz (81.5 kB view details)

Uploaded Feb 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataprism-0.1.0-py3-none-any.whl (80.1 kB view details)

Uploaded Feb 18, 2026 Python 3

File details

Details for the file dataprism-0.1.0.tar.gz.

File metadata

Download URL: dataprism-0.1.0.tar.gz
Upload date: Feb 18, 2026
Size: 81.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataprism-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`62ee69bd163e1b023b1a450877557766459c856ef307368a8ef6760249ae89d7`
MD5	`0518fba4b1e236fa1f1b1afd93688a0c`
BLAKE2b-256	`58c0c29276d4b342db467a319996d8081fb11bde3f4c98aef0ec253dac407fb1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataprism-0.1.0.tar.gz:

Publisher: publish.yml on lattiq/dataprism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataprism-0.1.0.tar.gz
- Subject digest: 62ee69bd163e1b023b1a450877557766459c856ef307368a8ef6760249ae89d7
- Sigstore transparency entry: 963266532
- Sigstore integration time: Feb 18, 2026
Source repository:
- Permalink: lattiq/dataprism@46f83293fff06003e2ecb7f6a8c3d30005377e6e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/lattiq
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@46f83293fff06003e2ecb7f6a8c3d30005377e6e
- Trigger Event: release

File details

Details for the file dataprism-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataprism-0.1.0-py3-none-any.whl
Upload date: Feb 18, 2026
Size: 80.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataprism-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39ee07a9e0ae136196361a7d0aa7793354b9d92395ee36c7e9aacad8d102e824`
MD5	`3199a59d64f6923999ded6edfc220fcc`
BLAKE2b-256	`0a4f31a1d6bfdb739c479349ac0e85372e110fcab44c452d6a204d227a78cb1d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataprism-0.1.0-py3-none-any.whl:

Publisher: publish.yml on lattiq/dataprism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataprism-0.1.0-py3-none-any.whl
- Subject digest: 39ee07a9e0ae136196361a7d0aa7793354b9d92395ee36c7e9aacad8d102e824
- Sigstore transparency entry: 963266537
- Sigstore integration time: Feb 18, 2026
Source repository:
- Permalink: lattiq/dataprism@46f83293fff06003e2ecb7f6a8c3d30005377e6e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/lattiq
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@46f83293fff06003e2ecb7f6a8c3d30005377e6e
- Trigger Event: release

dataprism 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

DataPrism

Interactive Viewer

Features

Installation

Quick Start

Basic Usage

With DatasetSchema

Stability Analysis

Cohort-Based (Train/Test)

Time-Based

Development

Documentation

Requirements

License

Contact

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance