Structural profiling for any dataset. Point it at your data. It tells you what you have.
Project description
database-whisper
Auto-discovers structural patterns in datasets. Point it at a file, it tells you which fields matter for disambiguation, recommends indexes, and characterizes the structure.
Zero configuration. Zero dependencies (core). Works on CSV, TSV, JSON, SQLite, Excel, Parquet, and SQL dumps.
Install
pip install database-whisper
Optional format support:
pip install openpyxl # for Excel .xlsx
pip install pyarrow # for Parquet
Quick start
import database_whisper as dw
report = dw.profile("your_data.csv")
print(report)
Output:
=== Structural Profile: your_data.csv ===
Records: 114,000 | Fields: 20
Structural Density: HIGH (112,871x speedup)
This dataset has deep categorical structure.
Auto-detected Identity: track_id, track_name, duration_ms
Discriminator Ladder:
1. track_genre 98.7% reduction #################### dominant
Recommended Indexes:
CREATE INDEX idx_track_genre ON tracks (track_genre);
-- Standalone index: 99% reduction alone.
Data Quality:
Ambiguous neighborhoods: 16,641 / 89,741 (18.5%)
Fully resolved by ladder: YES (100% accuracy)
Structural Fingerprint: SINGLE-AXIS
One field dominates. Minimal disambiguation depth needed.
What it does
Given any structured dataset, the algorithm:
- Auto-detects which fields are identity (primary keys) and which are provenance (record IDs to exclude)
- Discovers a discriminator ladder — the ordered sequence of fields that best resolves ambiguity among records sharing the same identity
- Measures retrieval speedup vs flat scan and structural density
- Recommends database indexes based on the discovered structure
- Classifies the dataset by its structural fingerprint (SINGLE-AXIS, DEEP-PIPELINE, ALREADY-UNIQUE, LOW-STRUCTURE)
Supported formats
| Format | Extension | Dependencies |
|---|---|---|
| CSV / TSV | .csv, .tsv | none |
| JSON (array or nested) | .json | none |
| NDJSON | .ndjson, .jsonl | none |
| SQLite | .db, .sqlite | none |
| SQL dump | .sql | none |
| Excel | .xlsx | openpyxl |
| Parquet | .parquet | pyarrow |
API
import database_whisper as dw
# Profile a file (auto-detects format)
report = dw.profile("data.csv")
report = dw.profile("data.db")
report = dw.profile("data.xlsx")
# Profile in-memory records
report = dw.profile_records(records, field_names=["col1", "col2", ...])
# Batch router
router = dw.Router()
router.ingest(records, identity_fields=["gene", "disease"])
result = router.query({"gene": "BRAF", "disease": "Melanoma"}, ask_field="therapy")
# Streaming / incremental
live = dw.LiveRouter(identity_fields=["gene", "disease"])
for record in stream:
event = live.insert(record)
# Memory with sleep consolidation
mem = dw.Memory(identity_fields=["gene", "disease"])
for fact in facts:
mem.insert(fact)
Tested domains
The algorithm has been validated on 9 datasets across different domains. Same code, different data, different structures discovered.
| Domain | Records | Speedup | Accuracy |
|---|---|---|---|
| Oncology (CIViC) | 4,825 | 4,761x | 100% |
| Pharma safety (FAERS) | 50,000 | 7,462x | 100% |
| Weather (NOAA Storm) | 50,000 | 50,000x | 100% |
| Astronomy (NASA Exoplanets) | 6,158 | 6,109x | 100% |
| Seismology (USGS Earthquakes) | 20,000 | 20,000x | 100% |
| Particle physics (CERN CMS) | 100,000 | 100,000x | 100% |
| Music (Spotify) | 114,000 | 112,871x | 100% |
| Astronomy (LSST PLAsTiCC) | 7,848 | 7,848x | 100% |
| Cosmology (LSST CosmoDC2) | 50,000 | 3x | 100% |
Research
- Paper I: Discriminator Ladder Learning — the algorithm and 3-domain validation
- Paper II: Five Consequences of One Algorithm — anomaly detection, reasoning traces, compression, federated bridging across 9 domains
Requirements
Python 3.9+. Core package has zero external dependencies.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file database_whisper-0.1.0.tar.gz.
File metadata
- Download URL: database_whisper-0.1.0.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ec6e660343cdc028ac456472f609241e5314d08eeef5a44d05d96a24629bc75
|
|
| MD5 |
f82ef936de57ef6b3f8ae15a12599620
|
|
| BLAKE2b-256 |
ed38e9d7a90136281a3d4fabd7e0faa653d6aa84964f72abe6b5392ef3c993dc
|
File details
Details for the file database_whisper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: database_whisper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03f99c690b5f26ca449c9610ca5020ff74a7b02177bef6d6405177d15af60b6b
|
|
| MD5 |
637dde456c3be1d214e0527f0439be66
|
|
| BLAKE2b-256 |
5991c10c6066b64a7e35d0ad1ed2b74b66d5146e23802fa7a6ce72c32a2b4841
|