Skip to main content

Quantum-Safe Columnar Storage Format with row-granular lazy decryption

Project description

QPQT - Quantum-Safe Columnar Storage Format

A purpose-built binary columnar file format (.qpqt) with native post-quantum cryptography and row-granular lazy decryption, a capability no existing columnar format offers.

Cryptographic stack: ML-KEM-768 (FIPS 203) + HKDF-SHA-256 + AES-256-GCM (FIPS 197)


Quick Start

pip install qpqt
import qpqt

# Generate a quantum-safe keypair
pub, sec = qpqt.keygen()
kid = qpqt.generate_key_id()

# Encrypt - ssn column is ML-KEM-768 + AES-256-GCM protected
w = qpqt.Writer("customers.qpqt",
                column_names=["id", "state", "ssn"],
                column_types=["int32", "string", "string"],
                pqc_columns=["ssn"],
                public_key=pub, key_id=kid)
w.write_batch({"id":[1,2,3], "state":["CA","NY","TX"], "ssn":["111","222","333"]}, 3)
w.close()

# Read - lazy decryption, only matching rows decrypted
r = qpqt.Reader("customers.qpqt")
r.set_secret_key(sec)
data = r.query(where={"id": 2})

The wheel bundles liboqs and OpenSSL. No system dependencies needed.


The Problem

Enterprises face a dual mandate: regulatory pressure to adopt post-quantum cryptography (CNSA 2.0, NIST FIPS 203, deadline 2035) and the need to maintain query performance on large-scale columnar data warehouses.

The naive approach - applying ML-KEM-768 at the row level - costs 9,600ms for 1M rows even with 4-core parallelization. That establishes the upper bound of the problem: PQC done wrong is unusable at analytical query scale.

The Solution

QPQT redesigns the storage format around PQC cost:

  1. Hybrid KEM construction - ML-KEM-768 is used once per 4,096-row page to encapsulate an AES-256-GCM page key. This reduces KEM operations from 1M to 250 per million rows.

  2. Fully separated column sections - structural (unencrypted) and PQC columns are physically isolated on disk at 4KB OS page boundaries. Predicates run on structural columns without loading the PQC section into CPU cache.

  3. Row-granular lazy decryption - predicates execute on cheap structural columns first. Only the individual rows that survive the predicate trigger KEM decapsulation and AES-GCM decryption.

  4. O(1) manifest lookup - a flat crypto manifest in the footer maps any row to its page key via pointer arithmetic.

Performance - Honest Three-Baseline Comparison

Benchmarked on Kaggle Xeon CPU (4 cores), 1M rows, real ML-KEM-768 + AES-256-GCM.

Two baselines are measured, not estimated:

  • Naive per-row PQC - row-level ML-KEM encapsulation. Establishes the upper bound of the problem. This is what a quick liboqs integration produces.
  • Competent per-page PQC - the correct hybrid KEM construction (per-page ML-KEM + AES-GCM, exactly like QPQT) but stored in a plain layout with no column separation and no lazy decryption. Decrypts every row in the queried column because decryption is chunk-granular. This isolates QPQT's actual contribution.
Selectivity Naive per-row Competent per-page QPQT QPQT vs competent
1% 9,600ms 2,150ms 78ms 27.6x
5% 9,600ms 2,111ms 163ms 12.9x
10% 9,600ms 2,113ms 264ms 8.0x
25% 9,600ms 2,103ms 557ms 3.8x
50% 9,600ms 2,148ms 1,055ms 2.0x
100% 9,600ms 2,147ms 2,098ms 1.02x (no advantage)

QPQT's contribution is row-granular lazy decryption. At low selectivity - the common case for analytical queries - it decrypts far fewer rows than a competent columnar-unaware implementation, giving 8-27x. As selectivity approaches 100%, the advantage shrinks to parity. At 100% selectivity QPQT offers no advantage over competent per-page PQC - when every row survives the predicate, there is nothing to skip.

Metric Value
Write throughput (1M rows) 534K rows/sec (1,871ms)
Structural scan (no crypto) 5ms, 188M rows/sec
File size (1M rows) 80MB
Storage vs naive per-row ML-KEM 80MB vs ~1,084MB (92% reduction)

Cryptographic Design

ML-KEM-768 keypair  ->  secret key stored in KMS (file holds only key_id)
                                |
                        Per page (4,096 rows):
                        ML-KEM-768 encapsulate(public_key)
                            |-- kem_ciphertext  ->  CRYPTO MANIFEST
                            +-- shared_secret (32 bytes)
                                        |
                                HKDF-SHA-256(shared_secret, page_context)
                                        +-- aes_page_key (32 bytes, unique per page)
                                                    |
                                            AES-256-GCM per row
                                            |-- IV (12B, deterministic)
                                            |-- ciphertext (= plaintext length)
                                            +-- auth_tag (16B, tamper detection)

IV construction and GCM nonce safety

QPQT uses deterministic AES-GCM IVs. This is safe because nonce uniqueness is guaranteed within every key scope. Each 4,096-row page derives its own unique AES-256 key via ML-KEM encapsulation + HKDF-SHA-256. The IV only needs to be unique under a given key, and within a single page key the (row_index, column_index) tuple is unique by construction. The file_uuid component prevents cross-file collision in the event a page key is ever reused across files. There is no nonce reuse under any single key - the failure mode that breaks GCM does not occur.

All components are NIST-approved and quantum-safe:

  • ML-KEM-768: FIPS 203 (replaces RSA/ECDH for key establishment)
  • AES-256-GCM: FIPS 197 (quantum-safe symmetrically; Grover's only halves the effective key strength, leaving 128-bit security)
  • HKDF-SHA-256: RFC 5869 / SP 800-56C

Why a Separate Format (and not Parquet)?

Parquet already has Modular Encryption - why not derive its AES key from ML-KEM and get quantum-safe Parquet today?

For encryption alone, you could. The encryption is not the contribution.

The contribution is row-granular lazy decryption. Parquet supports predicate pushdown and can skip entire encrypted column chunks via footer statistics. What it cannot do is decrypt only the surviving rows within a chunk that the predicate did not eliminate wholesale. Parquet decrypts at chunk granularity, not surviving-row granularity. Closing that gap requires physically separated structural columns and a per-row-addressable key manifest - a different file layout.

The three conditions no existing format satisfies simultaneously:

  1. Structural columns physically separated from encrypted columns at OS-page boundaries, so the filter never pages the encrypted section into cache.
  2. Every row's decryption key addressable in O(1) without decrypting anything first - the flat manifest in the footer.
  3. Decryption expressible at single-row granularity within a page. Parquet treats the chunk as an atomic encrypted unit.

The idea is simple. The format that makes it executable is the contribution.

File Format

+-----------------------------------------------------+
| FILE HEADER (48 bytes)                              |
| magic + version + file_uuid + total_rows + offsets  |
+-----------------------------------------------------+
| SCHEMA BLOCK (variable)                             |
+-----------------------------------------------------+
| KEY REFERENCE BLOCK (32 bytes) - key_id, not the key|
+-----------------------------------------------------+
| ROW GROUP 0  (100,000 rows)                         |
|  |-- SECTION 1: Structural columns (unencrypted)    |
|  |   [tightly packed, padded to 4KB boundary]       |
|  +-- SECTION 2: PQC columns (AES-256-GCM per row)   |
|      [starts on 4KB OS page boundary]               |
+-----------------------------------------------------+
| ROW GROUP 1 ... N                                   |
+-----------------------------------------------------+
| FILE FOOTER                                         |
|  |-- Row group offset table                         |
|  |-- CRYPTO MANIFEST (flat array, O(1) lookup)      |
|  +-- FOOTER HEADER (40 bytes) + CRC32               |
+-----------------------------------------------------+

Key Management

# Python
pub, sec = qpqt.keygen()
kid = qpqt.generate_key_id()

# CLI (build from source)
./qpqt keygen --out-pub pub.bin --out-sec sec.bin
  • Public key (1184 bytes) - safe to share with writers.
  • Secret key (2400 bytes) - never share, never commit.
  • Key ID (16 bytes) - stored in the file header, not the key itself.

If you lose the secret key, data encrypted with its public key is permanently unrecoverable.

Environment Recommended key storage
Local dev Outside repo, e.g. ~/.qpqt/keys/
AWS AWS KMS + Secrets Manager
Azure Azure Key Vault
GCP Cloud KMS
Databricks dbutils.secrets
On-premise HashiCorp Vault or HSM

Key rotation never requires rewriting existing data files - QPQT stores a key_id reference in the header, not the key itself.

Build from Source

For CLI usage or contributing:

# Prerequisites: Ubuntu 22.04+, CMake 3.16+, C++17, OpenSSL dev headers
bash scripts/install_deps.sh   # builds liboqs from source
mkdir build && cd build
cmake .. && make -j$(nproc)
./qpqt_tests                   # run all 39 tests
./qpqt_bench                   # reproduce the benchmark table

Ecosystem Integration

Tool How
Python / pandas pip install qpqt
CLI qpqt encrypt/decrypt/inspect on CSV or Parquet (build from source)
DuckDB / Polars / Spark qpqt_arrow export produces structural columns as Arrow IPC

Roadmap

  • v0.1 (current): PyPI wheel, full crypto stack, CLI, Python bindings, Arrow export, 39 tests
  • v0.2: pandas read_qpqt / to_qpqt one-liners, Parquet read/write in CLI, DuckDB recipe
  • v1.0: Spark DataSource connector, ML-DSA-65 metadata signatures, threat model doc
  • v2.0: Distributed operation, S3/Azure direct integration

License

MIT

Author

Rohan Prabhakar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

qpqt-0.3.0-cp312-cp312-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

qpqt-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

qpqt-0.3.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

qpqt-0.3.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

qpqt-0.3.0-cp38-cp38-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

File details

Details for the file qpqt-0.3.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.3.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 43d5bc5a0c1f9e80938d57689108390cc0a15d6c8e8b50358044551c3c7b2c0c
MD5 c9633ff0c26e31021fb354cbc88da972
BLAKE2b-256 5b08735c984b41a07583e834cfee51b5faa3b8ac327a5145e44debbf4308d4c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.3.0-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 388a1cb7fb28d082c31f579a0afb3a73dd444e1e915f36b0e7e686a1b35794da
MD5 302aab1f2d1e1b35b321a36b73137b17
BLAKE2b-256 b73ba45deac7a2853b3818f69eab952b5bd4c0c0f7ef043bd8fa7ab119b284ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.3.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.3.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f338d41d7f944990b441ce0638f35de845ea1c44ad99791d9015654a54a6ec6a
MD5 d03d02657ecc469334111954ecf52b38
BLAKE2b-256 95d0002d6fb7960d4ccc5cbc6ab7426b384a915f860121187026bb645a56582e

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.3.0-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.3.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: qpqt-0.3.0-cp39-cp39-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 4.1 MB
  • Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qpqt-0.3.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f83828c441765be62bbb8e07d7db716d14215fcc6092df08047a56c9e097f9d1
MD5 9ab0118c6341a65561d530fd6f6660f4
BLAKE2b-256 beea9bdd59859ba5a15eb78160d06ba5f8f84d714beb3428019b6081ffca7998

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.3.0-cp39-cp39-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.3.0-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: qpqt-0.3.0-cp38-cp38-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 4.1 MB
  • Tags: CPython 3.8, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qpqt-0.3.0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fd179c6d45a4199f75434a1cef2c87c654515dbbb6a44b0f843323df9e5064c2
MD5 a5402930bbfaf0133d2eb75b07c587ae
BLAKE2b-256 591c27a90c71f937d68c41822d9cf35e088c078a246bcaaf0f60439a83daf443

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.3.0-cp38-cp38-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page