Skip to main content

Deterministic dataset shape and semantic inference for Invariant

Project description

Datasculpt

CI Python 3.11+ License: MIT

Deterministic dataset shape and semantic inference for tabular data.

The Problem

Before data can be governed, queried, or compared across systems, its structural intent must be understood. Most data systems (catalogs, semantic layers, governance engines) assume this understanding exists but don't produce it.

The Solution

Datasculpt infers and explains structural intent:

  • Shape — Is this long or wide? Time in headers or rows?
  • Grain — What uniquely identifies each row?
  • Roles — Which columns are dimensions, measures, or keys?

What It Is Not

  • Not a data catalog (produces metadata, doesn't store it)
  • Not an ETL tool (analyzes structure, doesn't transform data)
  • Not a semantic layer (understands layout, not meaning)

Quick Start

pip install datasculpt
from datasculpt import infer

result = infer("data.csv")

print(result.proposal.shape_hypothesis)      # wide_observations
print(result.decision_record.grain.key_columns)  # ['geo_id', 'sex', 'age_group']

for col in result.proposal.columns:
    print(f"{col.name}: {col.role.value}")
# geo_id: dimension
# sex: dimension
# age_group: dimension
# population: measure
# unemployed: measure

Try It

🔬 Live Demo — Analyze datasets in your browser. No installation, no data leaves your machine.

Documentation

📚 Full Documentation

  • Quickstart — First inference in 5 minutes
  • Examples — See inference on different dataset shapes
  • Concepts — Understand shapes, roles, and grain
  • API Reference — Function signatures and types

Key Features

Five Dataset Shapes

Shape Description
long_observations Rows are atomic observations
long_indicators Unpivoted indicator/value pairs
wide_observations Measures as columns
wide_time_columns Time periods in column headers
series_column Time series as arrays in cells

Eight Column Roles

Role Purpose
key Contributes to uniqueness
dimension Categorical grouping
measure Numeric, aggregatable
time Temporal dimension
indicator_name Names in unpivoted data
value Values in unpivoted data
series Embedded time series
metadata Descriptive, non-analytical

Deterministic Inference

Same input → same output. No LLMs, no randomness, no hidden state.

Evidence-Based

Every decision is scored and justified:

>>> result.decision_record.hypotheses
[
    HypothesisScore(hypothesis=WIDE_OBSERVATIONS, score=0.72, reasons=[...]),
    HypothesisScore(hypothesis=LONG_OBSERVATIONS, score=0.65, reasons=[...]),
]

Interactive Mode

Resolve ambiguity with questions:

result = infer("data.csv", interactive=True)

if result.pending_questions:
    answers = {result.pending_questions[0].id: "long_indicators"}
    result = apply_answers(result, answers)

Installation Options

# Core only
pip install datasculpt

# With optional adapters
pip install datasculpt[frictionless]   # Schema validation
pip install datasculpt[dataprofiler]   # Statistical profiling
pip install datasculpt[all]            # Everything

Requirements

  • Python 3.11+
  • pandas 2.0+

Development

# Install with dev dependencies
make install-dev

# Run tests
make test

# Lint and format
make lint
make format

# Type checking
make typecheck

# Serve docs locally
make docs-serve

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasculpt-0.1.0.tar.gz (391.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datasculpt-0.1.0-py3-none-any.whl (111.3 kB view details)

Uploaded Python 3

File details

Details for the file datasculpt-0.1.0.tar.gz.

File metadata

  • Download URL: datasculpt-0.1.0.tar.gz
  • Upload date:
  • Size: 391.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for datasculpt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 24ddfdfc1f19da65372ff1fc5ee51cbb03bab86356b819e90b0ceab737525528
MD5 5633d6da193a404782807235808d352c
BLAKE2b-256 591f1e1f0d6dc338748234269abc96b700d511a5d456b8ed6469fe97bd0e6064

See more details on using hashes here.

File details

Details for the file datasculpt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datasculpt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 111.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for datasculpt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff253fd5a1d0e1102dc93f9b1a61ede2cf101478e3fb041c7ab4869e1b3e7827
MD5 6eb4c614cff403bc7eb993292777b923
BLAKE2b-256 c3004eb02587d2628cf12cc6841d46202c9de0eec65d41923e83a4d46454eaeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page