Skip to main content

SuperTable — versioned data lake library for SQL analytics on Parquet + Redis.

Project description

SuperTable

Python License: STPUL Version

SuperTable — versioned data lake library for SQL analytics.

SuperTable stores structured data as immutable Parquet snapshots on object storage (S3, MinIO, Azure Blob, GCP Cloud Storage, or local disk), keeps metadata, locks, and audit state in Redis, and queries everything through DuckDB (embedded) or Spark SQL. It is a Python library — there is no separate server process.

Installation

pip install supertable                # core + local storage
pip install "supertable[s3]"          # AWS S3
pip install "supertable[minio]"       # MinIO
pip install "supertable[azure]"       # Azure Blob
pip install "supertable[gcp]"         # Google Cloud Storage
pip install "supertable[all]"         # everything

Requirements: Python 3.10+, a reachable Redis 6+, and a configured storage backend (or local disk for development). See docs/02_configuration.md for environment variables.


Architecture

┌──────────────────────────────────────────────────┐
│                Python application                 │
│   (notebooks, ETL jobs, FastAPI handlers, etc.)   │
└──────────┬─────────────────────────┬──────────────┘
           │ DataWriter / DataReader │
           ▼                         ▼
   ┌───────────────┐        ┌────────────────────┐
   │  RedisCatalog │        │  StorageInterface  │
   │  metadata     │        │  Parquet files     │
   │  locks        │        │  S3 / MinIO /      │
   │  audit chain  │        │  Azure / GCP /     │
   └───────────────┘        │  Local             │
                            └────────────────────┘

Data is organised as Organization → SuperTable → SimpleTable. Each SimpleTable is a versioned, append-only collection of Parquet files backed by a snapshot linked list — every write produces a new immutable snapshot whose previous_snapshot points at the predecessor.

Layer Technology
Language Python 3.10+
Metadata store Redis 6+ (standalone or Sentinel HA)
Query engine (primary) DuckDB
Query engine (large) Spark SQL via Thrift
Data format Apache Parquet
Object storage MinIO / S3 / Azure / GCP / local
Mirror formats Delta Lake, Apache Iceberg, Parquet
Audit storage Redis Streams + Parquet

Quick example

from supertable import SuperTable, DataWriter, DataReader, engine

# Bootstrap catalogue + storage
SuperTable(super_name="example", organization="my-org")

# Write
dw = DataWriter(super_name="example", organization="my-org")
columns, rows, inserted, deleted = dw.write(
    role_name="superadmin",
    simple_name="facts",
    data=arrow_table,
    overwrite_columns=["day", "client"],
    lineage={"source_type": "manual", "source_id": "my-job"},
)

# Read
dr = DataReader(
    super_name="example",
    organization="my-org",
    query="SELECT day, sum(value) FROM facts GROUP BY day LIMIT 10",
)
df, status, message = dr.execute(role_name="superadmin", engine=engine.AUTO)

Demos

The package ships two runnable demos under supertable.demo:

# Numbered tutorial — runs the full lifecycle end-to-end.
supertable-demo-quickstart
# or
python -m supertable.demo.quickstart

# Synthetic webshop dataset.
supertable-demo-webshop-generate    # build ~1.2M rows on disk
supertable-demo-webshop-load        # load them into SuperTable
supertable-demo-webshop-topup       # continuous incremental refresh

Both demos are also runnable as module steps. Examples:

python -m supertable.demo.quickstart.s01_01_01_create_super_table
python -m supertable.demo.quickstart.s03_08_read_snapshot_history
python -m supertable.demo.webshop.generate

See supertable/demo/README.md for the full script index.


What's included

  • Versioned tables with snapshot isolation, upsert (overwrite_columns), soft deletes (delete_only=True), schema evolution, and staleness filtering
  • DuckDB query engine — embedded, zero-copy reads from object storage
  • Spark SQL via Thrift — for queries exceeding DuckDB memory limits
  • RBAC — role types (superadmin, admin, writer, reader, meta) with row-level and column-level security enforced through view chains
  • Audit logging — tamper-evident SHA-256 hash chain in Redis Streams with Parquet export
  • MonitoringMonitoringWriter pushes read/write/metric payloads to Redis lists; structured JSON logging with correlation IDs
  • Ingestion — staging areas (Staging) and automated ingestion pipes (SuperPipe)
  • Mirroring — optional Delta Lake / Iceberg / Parquet export after every write
  • Snapshot history — every write chains to previous_snapshot, enabling point-in-time inspection without separate historical tables

Documentation

See docs/00_index.md for the full table of contents.

# Document Description
01 Platform Overview Architecture, package layout, deployment, data flow
02 Configuration Environment variables and runtime settings
03 Data Model Organization → SuperTable → SimpleTable hierarchy
04 Storage Backends StorageInterface, S3, MinIO, Azure, GCP, local
05 Redis Catalog Metadata store, key naming, operations, CAS
06 Data Writer Write pipeline, locking, dedup, tombstones
07 Ingestion & Pipes Staging areas, automated ingestion pipes
08 Distributed Locking Redis locks, file locks, deadlock prevention
09 Query Engine DuckDB Lite/Pro, Spark SQL, auto selection
10 Data Reader Read facade, snapshot history, view chain
11 RBAC & Access Control Roles, users, row/column security
12 Audit Logging SHA-256 hash chain, DORA/SOC 2, SIEM
13 Table Mirroring Delta Lake, Iceberg, Parquet export
14 Monitoring Metrics writer, structured logging
15 Python SDK Core classes, demos, example index

License

Super Table Public Use License (STPUL) v1.0 — see LICENSE.

Copyright © Kladna Soft Kft. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supertable-2.0.1.tar.gz (397.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

supertable-2.0.1-py3-none-any.whl (463.8 kB view details)

Uploaded Python 3

File details

Details for the file supertable-2.0.1.tar.gz.

File metadata

  • Download URL: supertable-2.0.1.tar.gz
  • Upload date:
  • Size: 397.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for supertable-2.0.1.tar.gz
Algorithm Hash digest
SHA256 9ef61bc7687eb678159680ae864bb2c426b676057a65769cd1b2120d56909ade
MD5 2d4c7bdc4913f9627ebdbedc01df20aa
BLAKE2b-256 2c882653890413448025d07b5b0a2f0d07ca19f07d8c1fa146d3bbb478c7593b

See more details on using hashes here.

File details

Details for the file supertable-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: supertable-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 463.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for supertable-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd1a037b289ca6adf55673bcd6adb6d20025f0723b3ec27d7b2bda5a02dc7495
MD5 06322ea6301517111ab5055483c49132
BLAKE2b-256 e299268a908cee8fc18c269baef8e04f4acfac8708dbcd11f4f21533784c1486

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page