Skip to main content

Quantum-Safe Columnar Storage Format with row-granular lazy decryption

Project description

QPQT - Quantum-Safe Columnar Storage Format

A purpose-built binary columnar file format (.qpqt) with native post-quantum cryptography and row-granular lazy decryption, a capability no existing columnar format offers.

Cryptographic stack: ML-KEM-768 (FIPS 203) + HKDF-SHA-256 + AES-256-GCM (FIPS 197)


Quick Start

pip install qpqt
import qpqt

# Generate a quantum-safe keypair
pub, sec = qpqt.keygen()
kid = qpqt.generate_key_id()

# Encrypt - ssn column is ML-KEM-768 + AES-256-GCM protected
w = qpqt.Writer("customers.qpqt",
                column_names=["id", "state", "ssn"],
                column_types=["int32", "string", "string"],
                pqc_columns=["ssn"],
                public_key=pub, key_id=kid)
w.write_batch({"id":[1,2,3], "state":["CA","NY","TX"], "ssn":["111","222","333"]}, 3)
w.close()

# Read - lazy decryption, only matching rows decrypted
r = qpqt.Reader("customers.qpqt")
r.set_secret_key(sec)
data = r.query(where={"id": 2})

The wheel bundles liboqs and OpenSSL. No system dependencies needed.


The Problem

Enterprises face a dual mandate: regulatory pressure to adopt post-quantum cryptography (CNSA 2.0, NIST FIPS 203, deadline 2035) and the need to maintain query performance on large-scale columnar data warehouses.

The naive approach - applying ML-KEM-768 at the row level - costs 9,600ms for 1M rows even with 4-core parallelization. That establishes the upper bound of the problem: PQC done wrong is unusable at analytical query scale.

The Solution

QPQT redesigns the storage format around PQC cost:

  1. Hybrid KEM construction - ML-KEM-768 is used once per 4,096-row page to encapsulate an AES-256-GCM page key. This reduces KEM operations from 1M to 250 per million rows.

  2. Fully separated column sections - structural (unencrypted) and PQC columns are physically isolated on disk at 4KB OS page boundaries. Predicates run on structural columns without loading the PQC section into CPU cache.

  3. Row-granular lazy decryption - predicates execute on cheap structural columns first. Only the individual rows that survive the predicate trigger KEM decapsulation and AES-GCM decryption.

  4. O(1) manifest lookup - a flat crypto manifest in the footer maps any row to its page key via pointer arithmetic.

Performance - Honest Three-Baseline Comparison

Benchmarked on Kaggle Xeon CPU (4 cores), 1M rows, real ML-KEM-768 + AES-256-GCM.

Two baselines are measured, not estimated:

  • Naive per-row PQC - row-level ML-KEM encapsulation. Establishes the upper bound of the problem. This is what a quick liboqs integration produces.
  • Competent per-page PQC - the correct hybrid KEM construction (per-page ML-KEM + AES-GCM, exactly like QPQT) but stored in a plain layout with no column separation and no lazy decryption. Decrypts every row in the queried column because decryption is chunk-granular. This isolates QPQT's actual contribution.
Selectivity Naive per-row Competent per-page QPQT QPQT vs competent
1% 9,600ms 2,150ms 78ms 27.6x
5% 9,600ms 2,111ms 163ms 12.9x
10% 9,600ms 2,113ms 264ms 8.0x
25% 9,600ms 2,103ms 557ms 3.8x
50% 9,600ms 2,148ms 1,055ms 2.0x
100% 9,600ms 2,147ms 2,098ms 1.02x (no advantage)

QPQT's contribution is row-granular lazy decryption. At low selectivity - the common case for analytical queries - it decrypts far fewer rows than a competent columnar-unaware implementation, giving 8-27x. As selectivity approaches 100%, the advantage shrinks to parity. At 100% selectivity QPQT offers no advantage over competent per-page PQC - when every row survives the predicate, there is nothing to skip.

Metric Value
Write throughput (1M rows) 534K rows/sec (1,871ms)
Structural scan (no crypto) 5ms, 188M rows/sec
File size (1M rows) 80MB
Storage vs naive per-row ML-KEM 80MB vs ~1,084MB (92% reduction)

Cryptographic Design

ML-KEM-768 keypair  ->  secret key stored in KMS (file holds only key_id)
                                |
                        Per page (4,096 rows):
                        ML-KEM-768 encapsulate(public_key)
                            |-- kem_ciphertext  ->  CRYPTO MANIFEST
                            +-- shared_secret (32 bytes)
                                        |
                                HKDF-SHA-256(shared_secret, page_context)
                                        +-- aes_page_key (32 bytes, unique per page)
                                                    |
                                            AES-256-GCM per row
                                            |-- IV (12B, deterministic)
                                            |-- ciphertext (= plaintext length)
                                            +-- auth_tag (16B, tamper detection)

IV construction and GCM nonce safety

QPQT uses deterministic AES-GCM IVs. This is safe because nonce uniqueness is guaranteed within every key scope. Each 4,096-row page derives its own unique AES-256 key via ML-KEM encapsulation + HKDF-SHA-256. The IV only needs to be unique under a given key, and within a single page key the (row_index, column_index) tuple is unique by construction. The file_uuid component prevents cross-file collision in the event a page key is ever reused across files. There is no nonce reuse under any single key - the failure mode that breaks GCM does not occur.

All components are NIST-approved and quantum-safe:

  • ML-KEM-768: FIPS 203 (replaces RSA/ECDH for key establishment)
  • AES-256-GCM: FIPS 197 (quantum-safe symmetrically; Grover's only halves the effective key strength, leaving 128-bit security)
  • HKDF-SHA-256: RFC 5869 / SP 800-56C

Why a Separate Format (and not Parquet)?

Parquet already has Modular Encryption - why not derive its AES key from ML-KEM and get quantum-safe Parquet today?

For encryption alone, you could. The encryption is not the contribution.

The contribution is row-granular lazy decryption. Parquet supports predicate pushdown and can skip entire encrypted column chunks via footer statistics. What it cannot do is decrypt only the surviving rows within a chunk that the predicate did not eliminate wholesale. Parquet decrypts at chunk granularity, not surviving-row granularity. Closing that gap requires physically separated structural columns and a per-row-addressable key manifest - a different file layout.

The three conditions no existing format satisfies simultaneously:

  1. Structural columns physically separated from encrypted columns at OS-page boundaries, so the filter never pages the encrypted section into cache.
  2. Every row's decryption key addressable in O(1) without decrypting anything first - the flat manifest in the footer.
  3. Decryption expressible at single-row granularity within a page. Parquet treats the chunk as an atomic encrypted unit.

The idea is simple. The format that makes it executable is the contribution.

File Format

+-----------------------------------------------------+
| FILE HEADER (48 bytes)                              |
| magic + version + file_uuid + total_rows + offsets  |
+-----------------------------------------------------+
| SCHEMA BLOCK (variable)                             |
+-----------------------------------------------------+
| KEY REFERENCE BLOCK (32 bytes) - key_id, not the key|
+-----------------------------------------------------+
| ROW GROUP 0  (100,000 rows)                         |
|  |-- SECTION 1: Structural columns (unencrypted)    |
|  |   [tightly packed, padded to 4KB boundary]       |
|  +-- SECTION 2: PQC columns (AES-256-GCM per row)   |
|      [starts on 4KB OS page boundary]               |
+-----------------------------------------------------+
| ROW GROUP 1 ... N                                   |
+-----------------------------------------------------+
| FILE FOOTER                                         |
|  |-- Row group offset table                         |
|  |-- CRYPTO MANIFEST (flat array, O(1) lookup)      |
|  +-- FOOTER HEADER (40 bytes) + CRC32               |
+-----------------------------------------------------+

Key Management

# Python
pub, sec = qpqt.keygen()
kid = qpqt.generate_key_id()

# CLI (build from source)
./qpqt keygen --out-pub pub.bin --out-sec sec.bin
  • Public key (1184 bytes) - safe to share with writers.
  • Secret key (2400 bytes) - never share, never commit.
  • Key ID (16 bytes) - stored in the file header, not the key itself.

If you lose the secret key, data encrypted with its public key is permanently unrecoverable.

Environment Recommended key storage
Local dev Outside repo, e.g. ~/.qpqt/keys/
AWS AWS KMS + Secrets Manager
Azure Azure Key Vault
GCP Cloud KMS
Databricks dbutils.secrets
On-premise HashiCorp Vault or HSM

Key rotation never requires rewriting existing data files - QPQT stores a key_id reference in the header, not the key itself.

Build from Source

For CLI usage or contributing:

# Prerequisites: Ubuntu 22.04+, CMake 3.16+, C++17, OpenSSL dev headers
bash scripts/install_deps.sh   # builds liboqs from source
mkdir build && cd build
cmake .. && make -j$(nproc)
./qpqt_tests                   # run all 39 tests
./qpqt_bench                   # reproduce the benchmark table

Ecosystem Integration

Tool How
Python / pandas pip install qpqt
CLI qpqt encrypt/decrypt/inspect on CSV or Parquet (build from source)
DuckDB / Polars / Spark qpqt_arrow export produces structural columns as Arrow IPC

Roadmap

  • v0.1 (current): PyPI wheel, full crypto stack, CLI, Python bindings, Arrow export, 39 tests
  • v0.2: pandas read_qpqt / to_qpqt one-liners, Parquet read/write in CLI, DuckDB recipe
  • v1.0: Spark DataSource connector, ML-DSA-65 metadata signatures, threat model doc
  • v2.0: Distributed operation, S3/Azure direct integration

License

MIT

Author

Rohan Prabhakar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

qpqt-0.1.1-cp312-cp312-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

qpqt-0.1.1-cp311-cp311-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

qpqt-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

qpqt-0.1.1-cp39-cp39-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

qpqt-0.1.1-cp38-cp38-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

File details

Details for the file qpqt-0.1.1-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.1.1-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ab1b597841cf86faa819d4bd53cebccca86c40fd0d864516cd198dd0a07b6b0d
MD5 5cff1237cc508bd70f1400867627ee70
BLAKE2b-256 b49625c9710cae306df7f18436a25977289e128c336127197afaed26204424ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.1-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.1.1-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.1.1-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6138c28a3707f60467c399c5b2376010acd4e698ca863b441e8297f7f34e95bd
MD5 47a7d31b1fb9d7100e8fc0fd98ef48b5
BLAKE2b-256 d0e4500cc1bb8e35b291ac28218f26d633d01f908eab5d277f3a210e53e6903a

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.1-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d3b034382cc9290f6d40d995528a1bd91bb80a5ebdb4ac94c034869932d4326c
MD5 7dad6dbe4d08c96da128b3c0db04736f
BLAKE2b-256 455f2f2344abc9930cb00421a863c8fc14ff4ca2178f35fa1461798b8fb2bf01

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.1.1-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: qpqt-0.1.1-cp39-cp39-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 4.0 MB
  • Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qpqt-0.1.1-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 311571d155ac54bbecaaa6a79721d09602f519929697e8f503644fa6bf372132
MD5 865d1939ba1e86d8fda10ee9f5ed210e
BLAKE2b-256 7c2211371be5742f44318a1be919a6dc69f7a52b8078609fe8d9aa411ca51327

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.1-cp39-cp39-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.1.1-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: qpqt-0.1.1-cp38-cp38-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 4.0 MB
  • Tags: CPython 3.8, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qpqt-0.1.1-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fd899ff2c811bb072a8f9b1b9f0184c262e45669936d9b5ee0dd37e28bdc88ac
MD5 c5a4db7a8ef840712fa256db6f6a3bdc
BLAKE2b-256 6c66b271c1f3063e1c4fce39a64db37bbe661434c547a4c19a79895986feee0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.1-cp38-cp38-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page