Skip to main content

SuperTable — versioned data lake library for SQL analytics on Parquet + Redis.

Project description

SuperTable

Python License: STPUL Version

SuperTable — versioned data lake library for SQL analytics.

SuperTable stores structured data as immutable Parquet snapshots on object storage (S3, MinIO, Azure Blob, GCP Cloud Storage, or local disk), keeps metadata, locks, and audit state in Redis, and queries everything through DuckDB (embedded) or Spark SQL. It is a Python library — there is no separate server process.

Installation

pip install supertable                # core + local storage
pip install "supertable[s3]"          # AWS S3
pip install "supertable[minio]"       # MinIO
pip install "supertable[azure]"       # Azure Blob
pip install "supertable[gcp]"         # Google Cloud Storage
pip install "supertable[all]"         # everything

Requirements: Python 3.10+, a reachable Redis 6+, and a configured storage backend (or local disk for development). See docs/02_configuration.md for environment variables.


Architecture

┌──────────────────────────────────────────────────┐
│                Python application                 │
│   (notebooks, ETL jobs, FastAPI handlers, etc.)   │
└──────────┬─────────────────────────┬──────────────┘
           │ DataWriter / DataReader │
           ▼                         ▼
   ┌───────────────┐        ┌────────────────────┐
   │  RedisCatalog │        │  StorageInterface  │
   │  metadata     │        │  Parquet files     │
   │  locks        │        │  S3 / MinIO /      │
   │  audit chain  │        │  Azure / GCP /     │
   └───────────────┘        │  Local             │
                            └────────────────────┘

Data is organised as Organization → SuperTable → SimpleTable. Each SimpleTable is a versioned, append-only collection of Parquet files backed by a snapshot linked list — every write produces a new immutable snapshot whose previous_snapshot points at the predecessor.

Layer Technology
Language Python 3.10+
Metadata store Redis 6+ (standalone or Sentinel HA)
Query engine (primary) DuckDB
Query engine (large) Spark SQL via Thrift
Data format Apache Parquet
Object storage MinIO / S3 / Azure / GCP / local
Mirror formats Delta Lake, Apache Iceberg, Parquet
Audit storage Redis Streams + Parquet

Quick example

from supertable import SuperTable, DataWriter, DataReader, engine

# Bootstrap catalogue + storage
SuperTable(super_name="example", organization="my-org")

# Write
dw = DataWriter(super_name="example", organization="my-org")
columns, rows, inserted, deleted = dw.write(
    role_name="superadmin",
    simple_name="facts",
    data=arrow_table,
    overwrite_columns=["day", "client"],
    lineage={"source_type": "manual", "source_id": "my-job"},
)

# Read
dr = DataReader(
    super_name="example",
    organization="my-org",
    query="SELECT day, sum(value) FROM facts GROUP BY day LIMIT 10",
)
df, status, message = dr.execute(role_name="superadmin", engine=engine.AUTO)

Demos

The package ships two runnable demos under supertable.demo:

# Numbered tutorial — runs the full lifecycle end-to-end.
supertable-demo-quickstart
# or
python -m supertable.demo.quickstart

# Synthetic webshop dataset.
supertable-demo-webshop-generate    # build ~1.2M rows on disk
supertable-demo-webshop-load        # load them into SuperTable
supertable-demo-webshop-topup       # continuous incremental refresh

Both demos are also runnable as module steps. Examples:

python -m supertable.demo.quickstart.s01_01_01_create_super_table
python -m supertable.demo.quickstart.s03_08_read_snapshot_history
python -m supertable.demo.webshop.generate

See supertable/demo/README.md for the full script index.


What's included

  • Versioned tables with snapshot isolation, upsert (overwrite_columns), soft deletes (delete_only=True), schema evolution, and staleness filtering
  • DuckDB query engine — embedded, zero-copy reads from object storage
  • Spark SQL via Thrift — for queries exceeding DuckDB memory limits
  • RBAC — role types (superadmin, admin, writer, reader, meta) with row-level and column-level security enforced through view chains
  • Audit logging — tamper-evident SHA-256 hash chain in Redis Streams with Parquet export
  • MonitoringMonitoringWriter pushes read/write/metric payloads to Redis lists; structured JSON logging with correlation IDs
  • Ingestion — staging areas (Staging) and automated ingestion pipes (SuperPipe)
  • Mirroring — optional Delta Lake / Iceberg / Parquet export after every write
  • Snapshot history — every write chains to previous_snapshot, enabling point-in-time inspection without separate historical tables

Documentation

See docs/00_index.md for the full table of contents.

# Document Description
01 Platform Overview Architecture, package layout, deployment, data flow
02 Configuration Environment variables and runtime settings
03 Data Model Organization → SuperTable → SimpleTable hierarchy
04 Storage Backends StorageInterface, S3, MinIO, Azure, GCP, local
05 Redis Catalog Metadata store, key naming, operations, CAS
06 Data Writer Write pipeline, locking, dedup, tombstones
07 Ingestion & Pipes Staging areas, automated ingestion pipes
08 Distributed Locking Redis locks, file locks, deadlock prevention
09 Query Engine DuckDB Lite/Pro, Spark SQL, auto selection
10 Data Reader Read facade, snapshot history, view chain
11 RBAC & Access Control Roles, users, row/column security
12 Audit Logging SHA-256 hash chain, DORA/SOC 2, SIEM
13 Table Mirroring Delta Lake, Iceberg, Parquet export
14 Monitoring Metrics writer, structured logging
15 Python SDK Core classes, demos, example index

License

Super Table Public Use License (STPUL) v1.0 — see LICENSE.

Copyright © Kladna Soft Kft. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supertable-2.0.0.tar.gz (401.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

supertable-2.0.0-py3-none-any.whl (468.1 kB view details)

Uploaded Python 3

File details

Details for the file supertable-2.0.0.tar.gz.

File metadata

  • Download URL: supertable-2.0.0.tar.gz
  • Upload date:
  • Size: 401.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for supertable-2.0.0.tar.gz
Algorithm Hash digest
SHA256 ae3fa2dc8d3e1b9f687c888dea0ebc619f7bfacd118e035fa14f6d2ac690d583
MD5 b671b67a3d67c3ab0f7260a83e0d55a0
BLAKE2b-256 aab76c5dea0256dc18146f86ac2762fcc76007ce08474f7aff82194442e593e0

See more details on using hashes here.

File details

Details for the file supertable-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: supertable-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 468.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for supertable-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 328d9109b2e96419f28e1f9c06c6d3cfc11775622ef0860f8e0929ff5bda8bcc
MD5 4866f9e6f00955f7a0d6b812120d5d8d
BLAKE2b-256 e4b8ae40623c9e054215eb34c968ef0c223a2276196b3640c3a01ad8881bfb75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page