Skip to main content

Quantum-Safe Columnar Storage Format with row-granular lazy decryption

Project description

QPQT - Quantum-Safe Columnar Storage Format

A purpose-built binary columnar file format (.qpqt) with native post-quantum cryptography and row-granular lazy decryption, a capability no existing columnar format offers.

Cryptographic stack: ML-KEM-768 (FIPS 203) + HKDF-SHA-256 + AES-256-GCM (FIPS 197)


Quick Start

pip install qpqt
import qpqt

# Generate a quantum-safe keypair
pub, sec = qpqt.keygen()
kid = qpqt.generate_key_id()

# Encrypt - ssn column is ML-KEM-768 + AES-256-GCM protected
w = qpqt.Writer("customers.qpqt",
                column_names=["id", "state", "ssn"],
                column_types=["int32", "string", "string"],
                pqc_columns=["ssn"],
                public_key=pub, key_id=kid)
w.write_batch({"id":[1,2,3], "state":["CA","NY","TX"], "ssn":["111","222","333"]}, 3)
w.close()

# Read - lazy decryption, only matching rows decrypted
r = qpqt.Reader("customers.qpqt")
r.set_secret_key(sec)
data = r.query(where={"id": 2})

The wheel bundles liboqs and OpenSSL. No system dependencies needed.


The Problem

Enterprises face a dual mandate: regulatory pressure to adopt post-quantum cryptography (CNSA 2.0, NIST FIPS 203, deadline 2035) and the need to maintain query performance on large-scale columnar data warehouses.

The naive approach - applying ML-KEM-768 at the row level - costs 9,600ms for 1M rows even with 4-core parallelization. That establishes the upper bound of the problem: PQC done wrong is unusable at analytical query scale.

The Solution

QPQT redesigns the storage format around PQC cost:

  1. Hybrid KEM construction - ML-KEM-768 is used once per 4,096-row page to encapsulate an AES-256-GCM page key. This reduces KEM operations from 1M to 250 per million rows.

  2. Fully separated column sections - structural (unencrypted) and PQC columns are physically isolated on disk at 4KB OS page boundaries. Predicates run on structural columns without loading the PQC section into CPU cache.

  3. Row-granular lazy decryption - predicates execute on cheap structural columns first. Only the individual rows that survive the predicate trigger KEM decapsulation and AES-GCM decryption.

  4. O(1) manifest lookup - a flat crypto manifest in the footer maps any row to its page key via pointer arithmetic.

Performance - Honest Three-Baseline Comparison

Benchmarked on Kaggle Xeon CPU (4 cores), 1M rows, real ML-KEM-768 + AES-256-GCM.

Two baselines are measured, not estimated:

  • Naive per-row PQC - row-level ML-KEM encapsulation. Establishes the upper bound of the problem. This is what a quick liboqs integration produces.
  • Competent per-page PQC - the correct hybrid KEM construction (per-page ML-KEM + AES-GCM, exactly like QPQT) but stored in a plain layout with no column separation and no lazy decryption. Decrypts every row in the queried column because decryption is chunk-granular. This isolates QPQT's actual contribution.
Selectivity Naive per-row Competent per-page QPQT QPQT vs competent
1% 9,600ms 2,150ms 78ms 27.6x
5% 9,600ms 2,111ms 163ms 12.9x
10% 9,600ms 2,113ms 264ms 8.0x
25% 9,600ms 2,103ms 557ms 3.8x
50% 9,600ms 2,148ms 1,055ms 2.0x
100% 9,600ms 2,147ms 2,098ms 1.02x (no advantage)

QPQT's contribution is row-granular lazy decryption. At low selectivity - the common case for analytical queries - it decrypts far fewer rows than a competent columnar-unaware implementation, giving 8-27x. As selectivity approaches 100%, the advantage shrinks to parity. At 100% selectivity QPQT offers no advantage over competent per-page PQC - when every row survives the predicate, there is nothing to skip.

Metric Value
Write throughput (1M rows) 534K rows/sec (1,871ms)
Structural scan (no crypto) 5ms, 188M rows/sec
File size (1M rows) 80MB
Storage vs naive per-row ML-KEM 80MB vs ~1,084MB (92% reduction)

Cryptographic Design

ML-KEM-768 keypair  ->  secret key stored in KMS (file holds only key_id)
                                |
                        Per page (4,096 rows):
                        ML-KEM-768 encapsulate(public_key)
                            |-- kem_ciphertext  ->  CRYPTO MANIFEST
                            +-- shared_secret (32 bytes)
                                        |
                                HKDF-SHA-256(shared_secret, page_context)
                                        +-- aes_page_key (32 bytes, unique per page)
                                                    |
                                            AES-256-GCM per row
                                            |-- IV (12B, deterministic)
                                            |-- ciphertext (= plaintext length)
                                            +-- auth_tag (16B, tamper detection)

IV construction and GCM nonce safety

QPQT uses deterministic AES-GCM IVs. This is safe because nonce uniqueness is guaranteed within every key scope. Each 4,096-row page derives its own unique AES-256 key via ML-KEM encapsulation + HKDF-SHA-256. The IV only needs to be unique under a given key, and within a single page key the (row_index, column_index) tuple is unique by construction. The file_uuid component prevents cross-file collision in the event a page key is ever reused across files. There is no nonce reuse under any single key - the failure mode that breaks GCM does not occur.

All components are NIST-approved and quantum-safe:

  • ML-KEM-768: FIPS 203 (replaces RSA/ECDH for key establishment)
  • AES-256-GCM: FIPS 197 (quantum-safe symmetrically; Grover's only halves the effective key strength, leaving 128-bit security)
  • HKDF-SHA-256: RFC 5869 / SP 800-56C

Why a Separate Format (and not Parquet)?

Parquet already has Modular Encryption - why not derive its AES key from ML-KEM and get quantum-safe Parquet today?

For encryption alone, you could. The encryption is not the contribution.

The contribution is row-granular lazy decryption. Parquet supports predicate pushdown and can skip entire encrypted column chunks via footer statistics. What it cannot do is decrypt only the surviving rows within a chunk that the predicate did not eliminate wholesale. Parquet decrypts at chunk granularity, not surviving-row granularity. Closing that gap requires physically separated structural columns and a per-row-addressable key manifest - a different file layout.

The three conditions no existing format satisfies simultaneously:

  1. Structural columns physically separated from encrypted columns at OS-page boundaries, so the filter never pages the encrypted section into cache.
  2. Every row's decryption key addressable in O(1) without decrypting anything first - the flat manifest in the footer.
  3. Decryption expressible at single-row granularity within a page. Parquet treats the chunk as an atomic encrypted unit.

The idea is simple. The format that makes it executable is the contribution.

File Format

+-----------------------------------------------------+
| FILE HEADER (48 bytes)                              |
| magic + version + file_uuid + total_rows + offsets  |
+-----------------------------------------------------+
| SCHEMA BLOCK (variable)                             |
+-----------------------------------------------------+
| KEY REFERENCE BLOCK (32 bytes) - key_id, not the key|
+-----------------------------------------------------+
| ROW GROUP 0  (100,000 rows)                         |
|  |-- SECTION 1: Structural columns (unencrypted)    |
|  |   [tightly packed, padded to 4KB boundary]       |
|  +-- SECTION 2: PQC columns (AES-256-GCM per row)   |
|      [starts on 4KB OS page boundary]               |
+-----------------------------------------------------+
| ROW GROUP 1 ... N                                   |
+-----------------------------------------------------+
| FILE FOOTER                                         |
|  |-- Row group offset table                         |
|  |-- CRYPTO MANIFEST (flat array, O(1) lookup)      |
|  +-- FOOTER HEADER (40 bytes) + CRC32               |
+-----------------------------------------------------+

Key Management

# Python
pub, sec = qpqt.keygen()
kid = qpqt.generate_key_id()

# CLI (build from source)
./qpqt keygen --out-pub pub.bin --out-sec sec.bin
  • Public key (1184 bytes) - safe to share with writers.
  • Secret key (2400 bytes) - never share, never commit.
  • Key ID (16 bytes) - stored in the file header, not the key itself.

If you lose the secret key, data encrypted with its public key is permanently unrecoverable.

Environment Recommended key storage
Local dev Outside repo, e.g. ~/.qpqt/keys/
AWS AWS KMS + Secrets Manager
Azure Azure Key Vault
GCP Cloud KMS
Databricks dbutils.secrets
On-premise HashiCorp Vault or HSM

Key rotation never requires rewriting existing data files - QPQT stores a key_id reference in the header, not the key itself.

Build from Source

For CLI usage or contributing:

# Prerequisites: Ubuntu 22.04+, CMake 3.16+, C++17, OpenSSL dev headers
bash scripts/install_deps.sh   # builds liboqs from source
mkdir build && cd build
cmake .. && make -j$(nproc)
./qpqt_tests                   # run all 39 tests
./qpqt_bench                   # reproduce the benchmark table

Ecosystem Integration

Tool How
Python / pandas pip install qpqt
CLI qpqt encrypt/decrypt/inspect on CSV or Parquet (build from source)
DuckDB / Polars / Spark qpqt_arrow export produces structural columns as Arrow IPC

Roadmap

  • v0.1 (current): PyPI wheel, full crypto stack, CLI, Python bindings, Arrow export, 39 tests
  • v0.2: pandas read_qpqt / to_qpqt one-liners, Parquet read/write in CLI, DuckDB recipe
  • v1.0: Spark DataSource connector, ML-DSA-65 metadata signatures, threat model doc
  • v2.0: Distributed operation, S3/Azure direct integration

License

MIT

Author

Rohan Prabhakar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

qpqt-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

qpqt-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

qpqt-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

qpqt-0.2.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

qpqt-0.2.0-cp38-cp38-manylinux_2_28_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

File details

Details for the file qpqt-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a6b90e2d5a7c32550abf8e5b532dfeeb7f9e5abe91c4365e46c90fed0f46c931
MD5 e02ab4b2c4b7ed9d5ffa5a595e70851e
BLAKE2b-256 7fd22521daa10c5d42d3a8a18dce766510915d315018a9ea4aaecad66dd44253

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bfd733d83515bb2e97ab919d994fe9265df0b657b6170ed13c82d5940c387405
MD5 4b3e535e2a695603da3e2b79d02d6c63
BLAKE2b-256 5619fef1805e011bb265be5c22278a0eecb62761465923ca8220982bc67c2ba6

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 115731aa1cb621054b15999b2a0a8d8c35d0dc9a5c333a854426dd295e24da00
MD5 597e822eb147a3310eb31c30fa4b0d1e
BLAKE2b-256 bd6eee0071cb72a950b88840f0f7d4ee3920968984b3f6db8278917e326c6d0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.2.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: qpqt-0.2.0-cp39-cp39-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 4.1 MB
  • Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qpqt-0.2.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 688995d2c72a9f7b27a5a58f6485cf7428778fb16928119bd6ba406ad7c7da57
MD5 3fda0bcffbfe19220e48c23da6ef992d
BLAKE2b-256 0b5e5290f46a437b3dcc6a5fff1a7a0b238780e22c84c08dfecb01d0c9c639a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.2.0-cp39-cp39-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.2.0-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: qpqt-0.2.0-cp38-cp38-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 4.1 MB
  • Tags: CPython 3.8, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qpqt-0.2.0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 281b11cf9de5cfd41abe6abf26c61a251a73ea134c57485b06c8589c097d0ed0
MD5 a82e1efe587eb4d7d60f89c53e2a58fe
BLAKE2b-256 de9b484d544a6a3a5b65549ec6eb72b866d6a64dbcdf15c1238c02a7bc9ceced

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.2.0-cp38-cp38-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page