Skip to main content

Explainable instance significance discovery for scientific datasets

Reason this release was yanked:

numba dependency error

Project description

DataTypical

Explainable Instance Significance Discovery for Scientific Datasets

Python 3.8+ License: MIT

DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.


Key Features

  • Three Significance Types: Archetypal, prototypical, stereotypical (all computed simultaneously)
  • Shapley Explanations: Feature-level attributions for why samples are significant
  • Formative Discovery: Distinguish samples that ARE significant from those that CREATE structure
  • Publication Visualizations: Dual-perspective scatter plots, heatmaps, and profile plots
  • Multi-Modal Support: Tabular data, text, and graph networks through unified API
  • Performance Optimized: Fast exploration mode and efficient Shapley computation

Installation

pip install datatypical

Quick Start

from datatypical import DataTypical
from datatypical_viz import significance_plot, heatmap, profile_plot
import pandas as pd

# Load your data
data = pd.read_csv('your_data.csv')

# Analyze with explanations
dt = DataTypical(shapley_mode=True)
results = dt.fit_transform(data)

# Three significance perspectives (0-1 normalized ranks)
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])

# Visualize: which samples are critical vs replaceable?
significance_plot(results, significance='archetypal')

# Understand: which features drive significance?
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# Explain: why is this sample significant?
top_idx = results['archetypal_rank'].idxmax()
profile_plot(dt, top_idx, significance='archetypal', order='local')

What DataTypical Does

Three Complementary Lenses

Lens Finds Use Cases
Archetypal Extreme, boundary samples Edge case discovery, outlier detection, range understanding
Prototypical Representative, central samples Dataset summarization, cluster centers, data coverage
Stereotypical Target-similar samples Optimization, goal-oriented selection, phenotype matching

The Power: All three computed simultaneously—different perspectives reveal different insights.

Dual Perspective (with Shapley)

When shapley_mode=True, DataTypical reveals two views:

  • Actual Significance (*_rank): Samples that ARE significant
  • Formative Significance (*_shapley_rank): Samples that CREATE the structure

This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.


Example: Drug Discovery

# Analyze compound library
dt = DataTypical(
    shapley_mode=True,
    stereotype_column='activity',  # Target property
    fast_mode=False
)
results = dt.fit_transform(compounds)

# Find critical compounds (high actual + high formative)
critical = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] > 0.8)
]
print(f"Found {len(critical)} critical compounds")

# Find redundant compounds (high actual + low formative)
redundant = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] < 0.3)
]
print(f"Found {len(redundant)} replaceable compounds")

# Understand alternative mechanisms
for idx in critical.index:
    profile_plot(dt, idx, significance='stereotypical')
    # Each shows different feature pattern → different mechanism

Discovery: Multiple structural pathways to high activity.


Key Parameters

DataTypical(
    shapley_mode=False,           # True for explanations
    fast_mode=True,               # False for publication quality
    n_archetypes=8,               # Number of extreme corners
    n_prototypes=8,               # Number of representatives
    stereotype_column=None,       # Target column for stereotypical
    shapley_top_n=500,            # Limit explanations to top N
    shapley_n_permutations=100,   # Number of permutations
    random_state=None,            # Set for reproducible results
    max_memory_mb=8000            # Memory limit
)

Visualization Functions

from datatypical import significance_plot, heatmap, profile_plot

# 1. Dual-perspective scatter plot
significance_plot(results, significance='archetypal')

# 2. Feature attribution heatmap
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# 3. Individual sample profile
profile_plot(dt, sample_idx, significance='archetypal', order='local')

Multi-Modal Support

Tabular Data

df = pd.DataFrame(...)
dt = DataTypical()
results = dt.fit_transform(df)

Text Data

texts = ["document 1", "document 2", ...]
dt = DataTypical()
results = dt.fit_transform(texts)

Graph Networks

node_features = pd.DataFrame(...)
edges = [(0, 1), (1, 2), ...]
dt = DataTypical()
results = dt.fit_transform(node_features, edges=edges)

Performance

Dataset Size Without Shapley With Shapley
1,000 samples ~5 seconds ~5 minutes
10,000 samples ~30 seconds ~60 minutes

Optimization Strategy:

  1. Fast exploration (fast_mode=True, no Shapley)
  2. Identify interesting samples
  3. Detailed analysis (shapley_mode=True, subset)
  4. Generate publication figures

Use Cases

Scientific Discovery: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification

Dataset Curation: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling

Model Understanding: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations


What Makes DataTypical Different

From outlier detection: Finds extremes AND explains why

From clustering: Finds representatives maximizing coverage AND explains why

From feature selection: Explains which features matter for which samples

From PCA/t-SNE: Maintains interpretability in original feature space

The Novel Contribution: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.


Documentation

Complete documentation, examples, and guides available at:
https://github.com/amaxiom/DataTypical

Includes:

  • Getting started tutorials
  • Comprehensive examples across scientific domains
  • Visualization interpretation guides
  • Advanced usage and computation details
  • Test suite and benchmarks

Support


Requirements

  • Python ≥ 3.8
  • NumPy ≥ 1.20
  • Pandas ≥ 1.3
  • SciPy ≥ 1.7
  • scikit-learn ≥ 1.0
  • Matplotlib ≥ 3.3
  • Seaborn ≥ 0.11
  • Numba ≥ 0.55

Citation

If you use DataTypical in your research, please cite:

@software{datatypical2025,
  author = {Barnard, Amanda S.},
  title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
  year = {2026},
  url = {https://github.com/amaxiom/DataTypical},
  version = {0.7}
}

License

MIT License - Copyright (c) 2026 Amanda S. Barnard

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.


Acknowledgments

DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatypical-0.7.0.tar.gz (42.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datatypical-0.7.0-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file datatypical-0.7.0.tar.gz.

File metadata

  • Download URL: datatypical-0.7.0.tar.gz
  • Upload date:
  • Size: 42.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for datatypical-0.7.0.tar.gz
Algorithm Hash digest
SHA256 8a8299cf02e6860f727db7fc008a8ed44b09525544fb24db1082b53a1318f4a2
MD5 696f931f8b0c67a19209985af826f98c
BLAKE2b-256 bcdd215b9149b74430dba7e7315e9cd4cb28320d6de726caeeac76dabad58519

See more details on using hashes here.

File details

Details for the file datatypical-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: datatypical-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for datatypical-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ef955f970433bcaf12a8d98a8fd27f858c48199fdb332a6eef39dd3d7097e7c2
MD5 4c466aeb8ae9ff1fbfb84b4caa8b8843
BLAKE2b-256 6585b1fa81009d9f8802da5b51c4c2a5ebae6a7d0807b31fc6dec6ab122355de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page