Skip to main content

Structural profiling for any dataset. Point it at your data. It tells you what you have.

Project description

database-whisper

Auto-discovers structural patterns in datasets. Point it at a file, it tells you which fields matter for disambiguation, recommends indexes, and characterizes the structure.

Zero configuration. Zero dependencies (core). Works on CSV, TSV, JSON, SQLite, Excel, Parquet, and SQL dumps.

Install

pip install database-whisper

Optional format support:

pip install openpyxl    # for Excel .xlsx
pip install pyarrow     # for Parquet

Quick start

import database_whisper as dw

report = dw.profile("your_data.csv")
print(report)

Output:

=== Structural Profile: your_data.csv ===
Records: 114,000 | Fields: 20

Structural Density: HIGH (112,871x speedup)
  This dataset has deep categorical structure.

Auto-detected Identity: track_id, track_name, duration_ms

Discriminator Ladder:
  1. track_genre          98.7% reduction  ####################  dominant

Recommended Indexes:
  CREATE INDEX idx_track_genre ON tracks (track_genre);
    -- Standalone index: 99% reduction alone.

Data Quality:
  Ambiguous neighborhoods: 16,641 / 89,741 (18.5%)
  Fully resolved by ladder: YES (100% accuracy)

Structural Fingerprint: SINGLE-AXIS
  One field dominates. Minimal disambiguation depth needed.

What it does

Given any structured dataset, the algorithm:

  1. Auto-detects which fields are identity (primary keys) and which are provenance (record IDs to exclude)
  2. Discovers a discriminator ladder — the ordered sequence of fields that best resolves ambiguity among records sharing the same identity
  3. Measures retrieval speedup vs flat scan and structural density
  4. Recommends database indexes based on the discovered structure
  5. Classifies the dataset by its structural fingerprint (SINGLE-AXIS, DEEP-PIPELINE, ALREADY-UNIQUE, LOW-STRUCTURE)

Supported formats

Format Extension Dependencies
CSV / TSV .csv, .tsv none
JSON (array or nested) .json none
NDJSON .ndjson, .jsonl none
SQLite .db, .sqlite none
SQL dump .sql none
Excel .xlsx openpyxl
Parquet .parquet pyarrow

API

import database_whisper as dw

# Profile a file (auto-detects format)
report = dw.profile("data.csv")
report = dw.profile("data.db")
report = dw.profile("data.xlsx")

# Profile in-memory records
report = dw.profile_records(records, field_names=["col1", "col2", ...])

# Batch router
router = dw.Router()
router.ingest(records, identity_fields=["gene", "disease"])
result = router.query({"gene": "BRAF", "disease": "Melanoma"}, ask_field="therapy")

# Streaming / incremental
live = dw.LiveRouter(identity_fields=["gene", "disease"])
for record in stream:
    event = live.insert(record)

# Memory with sleep consolidation
mem = dw.Memory(identity_fields=["gene", "disease"])
for fact in facts:
    mem.insert(fact)

Tested domains

The algorithm has been validated on 9 datasets across different domains. Same code, different data, different structures discovered.

Domain Records Speedup Accuracy
Oncology (CIViC) 4,825 4,761x 100%
Pharma safety (FAERS) 50,000 7,462x 100%
Weather (NOAA Storm) 50,000 50,000x 100%
Astronomy (NASA Exoplanets) 6,158 6,109x 100%
Seismology (USGS Earthquakes) 20,000 20,000x 100%
Particle physics (CERN CMS) 100,000 100,000x 100%
Music (Spotify) 114,000 112,871x 100%
Astronomy (LSST PLAsTiCC) 7,848 7,848x 100%
Cosmology (LSST CosmoDC2) 50,000 3x 100%

Research

Requirements

Python 3.9+. Core package has zero external dependencies.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

database_whisper-0.1.0.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

database_whisper-0.1.0-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file database_whisper-0.1.0.tar.gz.

File metadata

  • Download URL: database_whisper-0.1.0.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for database_whisper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5ec6e660343cdc028ac456472f609241e5314d08eeef5a44d05d96a24629bc75
MD5 f82ef936de57ef6b3f8ae15a12599620
BLAKE2b-256 ed38e9d7a90136281a3d4fabd7e0faa653d6aa84964f72abe6b5392ef3c993dc

See more details on using hashes here.

File details

Details for the file database_whisper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for database_whisper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 03f99c690b5f26ca449c9610ca5020ff74a7b02177bef6d6405177d15af60b6b
MD5 637dde456c3be1d214e0527f0439be66
BLAKE2b-256 5991c10c6066b64a7e35d0ad1ed2b74b66d5146e23802fa7a6ce72c32a2b4841

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page