Skip to main content

SuperTable — versioned data lake library for SQL analytics on Parquet + Redis.

Project description

SuperTable

Python License: STPUL Version

SuperTable — versioned data lake library for SQL analytics.

SuperTable stores structured data as immutable Parquet snapshots on object storage (S3, MinIO, Azure Blob, GCP Cloud Storage, or local disk), keeps metadata, locks, and audit state in Redis, and queries everything through DuckDB (embedded) or Spark SQL. It is a Python library — there is no separate server process.

Installation

pip install supertable                # core + local storage
pip install "supertable[s3]"          # AWS S3
pip install "supertable[minio]"       # MinIO
pip install "supertable[azure]"       # Azure Blob
pip install "supertable[gcp]"         # Google Cloud Storage
pip install "supertable[all]"         # everything

Requirements: Python 3.10+, a reachable Redis 6+, and a configured storage backend (or local disk for development). See docs/02_configuration.md for environment variables.


Architecture

┌──────────────────────────────────────────────────┐
│                Python application                 │
│   (notebooks, ETL jobs, FastAPI handlers, etc.)   │
└──────────┬─────────────────────────┬──────────────┘
           │ DataWriter / DataReader │
           ▼                         ▼
   ┌───────────────┐        ┌────────────────────┐
   │  RedisCatalog │        │  StorageInterface  │
   │  metadata     │        │  Parquet files     │
   │  locks        │        │  S3 / MinIO /      │
   │  audit chain  │        │  Azure / GCP /     │
   └───────────────┘        │  Local             │
                            └────────────────────┘

Data is organised as Organization → SuperTable → SimpleTable. Each SimpleTable is a versioned, append-only collection of Parquet files backed by a snapshot linked list — every write produces a new immutable snapshot whose previous_snapshot points at the predecessor.

Layer Technology
Language Python 3.10+
Metadata store Redis 6+ (standalone or Sentinel HA)
Query engine (primary) DuckDB
Query engine (large) Spark SQL via Thrift
Data format Apache Parquet
Object storage MinIO / S3 / Azure / GCP / local
Mirror formats Delta Lake, Apache Iceberg, Parquet
Audit storage Redis Streams + Parquet

Quick example

from supertable import SuperTable, DataWriter, DataReader, engine

# Bootstrap catalogue + storage
SuperTable(super_name="example", organization="my-org")

# Write
dw = DataWriter(super_name="example", organization="my-org")
columns, rows, inserted, deleted = dw.write(
    role_name="superadmin",
    simple_name="facts",
    data=arrow_table,
    overwrite_columns=["day", "client"],
    lineage={"source_type": "manual", "source_id": "my-job"},
)

# Read
dr = DataReader(
    super_name="example",
    organization="my-org",
    query="SELECT day, sum(value) FROM facts GROUP BY day LIMIT 10",
)
df, status, message = dr.execute(role_name="superadmin", engine=engine.AUTO)

Demos

The package ships two runnable demos under supertable.demo:

# Numbered tutorial — runs the full lifecycle end-to-end.
supertable-demo-quickstart
# or
python -m supertable.demo.quickstart

# Synthetic webshop dataset.
supertable-demo-webshop-generate    # build ~1.2M rows on disk
supertable-demo-webshop-load        # load them into SuperTable
supertable-demo-webshop-topup       # continuous incremental refresh

Both demos are also runnable as module steps. Examples:

python -m supertable.demo.quickstart.s01_01_01_create_super_table
python -m supertable.demo.quickstart.s03_08_read_snapshot_history
python -m supertable.demo.webshop.generate

See supertable/demo/README.md for the full script index.


What's included

  • Versioned tables with snapshot isolation, upsert (overwrite_columns), soft deletes (delete_only=True), schema evolution, and staleness filtering
  • DuckDB query engine — embedded, zero-copy reads from object storage
  • Spark SQL via Thrift — for queries exceeding DuckDB memory limits
  • RBAC — role types (superadmin, admin, writer, reader, meta) with row-level and column-level security enforced through view chains
  • Audit logging — tamper-evident SHA-256 hash chain in Redis Streams with Parquet export
  • MonitoringMonitoringWriter pushes read/write/metric payloads to Redis lists; structured JSON logging with correlation IDs
  • Ingestion — staging areas (Staging) and automated ingestion pipes (SuperPipe)
  • Mirroring — optional Delta Lake / Iceberg / Parquet export after every write
  • Snapshot history — every write chains to previous_snapshot, enabling point-in-time inspection without separate historical tables

Documentation

See docs/00_index.md for the full table of contents.

# Document Description
01 Platform Overview Architecture, package layout, deployment, data flow
02 Configuration Environment variables and runtime settings
03 Data Model Organization → SuperTable → SimpleTable hierarchy
04 Storage Backends StorageInterface, S3, MinIO, Azure, GCP, local
05 Redis Catalog Metadata store, key naming, operations, CAS
06 Data Writer Write pipeline, locking, dedup, tombstones
07 Ingestion & Pipes Staging areas, automated ingestion pipes
08 Distributed Locking Redis locks, file locks, deadlock prevention
09 Query Engine DuckDB Lite/Pro, Spark SQL, auto selection
10 Data Reader Read facade, snapshot history, view chain
11 RBAC & Access Control Roles, users, row/column security
12 Audit Logging SHA-256 hash chain, DORA/SOC 2, SIEM
13 Table Mirroring Delta Lake, Iceberg, Parquet export
14 Monitoring Metrics writer, structured logging
15 Python SDK Core classes, demos, example index

License

Super Table Public Use License (STPUL) v1.0 — see LICENSE.

Copyright © Kladna Soft Kft. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supertable-2.0.2.tar.gz (398.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

supertable-2.0.2-py3-none-any.whl (464.7 kB view details)

Uploaded Python 3

File details

Details for the file supertable-2.0.2.tar.gz.

File metadata

  • Download URL: supertable-2.0.2.tar.gz
  • Upload date:
  • Size: 398.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for supertable-2.0.2.tar.gz
Algorithm Hash digest
SHA256 3e32a69ef0bb1e484b8439b4df6b993ac467e40ba793664e8ddaa37368331470
MD5 4e84ad66e70c051e1029bba2677cd7b6
BLAKE2b-256 9d0ac22edd8ec79d20fdf9628283a89bab35636401cdc87e508d71361efbbf9f

See more details on using hashes here.

File details

Details for the file supertable-2.0.2-py3-none-any.whl.

File metadata

  • Download URL: supertable-2.0.2-py3-none-any.whl
  • Upload date:
  • Size: 464.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for supertable-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8e7ac32748254b266afc752049ce086a7f318044aa8a9887e1dcefb4b1077eb3
MD5 b9c31ff37df59365faf4ff2434817f13
BLAKE2b-256 6942937d5ffcb707f5d3040eec127931859490180f4ccd595df63c6e5314fde8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page