Skip to main content

Quantum-Safe Columnar Storage Format with row-granular lazy decryption

Project description

QPQT - Quantum-Safe Columnar Storage Format

A purpose-built binary columnar file format (.qpqt) with native post-quantum cryptography and row-granular lazy decryption, a capability no existing columnar format offers.

Cryptographic stack: ML-KEM-768 (FIPS 203) + HKDF-SHA3-256 + AES-256-GCM (FIPS 197)


Quick Start

# 1. Install dependencies (liboqs + OpenSSL)
bash scripts/install_deps.sh

# 2. Build
mkdir build && cd build && cmake .. && make -j$(nproc) && cd ..

# 3. Generate a quantum-safe keypair
./build/qpqt keygen --out-pub pub.bin --out-sec sec.bin

# 4. Encrypt a CSV - ssn and dob become quantum-safe encrypted columns
./build/qpqt encrypt \
    --input customers.csv \
    --pqc-columns ssn,dob \
    --pub-key pub.bin \
    --output customers.qpqt

# 5. Inspect the file (no keys required - safe to run anywhere)
./build/qpqt inspect --input customers.qpqt

# 6. Decrypt for authorized users (lazy - only matching rows decrypted)
./build/qpqt decrypt \
    --input customers.qpqt \
    --sec-key sec.bin \
    --where "customer_id=12345" \
    --output result.csv

Python:

import qpqt, pandas as pd

pub, sec = qpqt.keygen()
kid = qpqt.generate_key_id()

# Write
w = qpqt.Writer("customers.qpqt",
                column_names=["id", "state", "ssn"],
                column_types=["int32", "string", "string"],
                pqc_columns=["ssn"],
                public_key=pub, key_id=kid)
w.write_batch({"id":[1,2,3], "state":["CA","NY","TX"], "ssn":["111","222","333"]}, 3)
w.close()

# Read - lazy decryption, only matching rows decrypted
r = qpqt.Reader("customers.qpqt")
r.set_secret_key(sec)
df = pd.DataFrame(r.query(where={"id": 2}))

The Problem

Enterprises face a dual mandate: regulatory pressure to adopt post-quantum cryptography (CNSA 2.0, NIST FIPS 203, deadline 2035) and the need to maintain query performance on large-scale columnar data warehouses.

The naive approach - applying ML-KEM-768 at the row level - costs 9,600ms for 1M rows even with 4-core parallelization. That establishes the upper bound of the problem: PQC done wrong is unusable at analytical query scale.

The Solution

QPQT redesigns the storage format around PQC cost:

  1. Hybrid KEM construction - ML-KEM-768 is used once per 4,096-row page to encapsulate an AES-256-GCM page key. This reduces KEM operations from 1M to 250 per million rows.

  2. Fully separated column sections - structural (unencrypted) and PQC columns are physically isolated on disk at 4KB OS page boundaries. Predicates run on structural columns without loading the PQC section into CPU cache.

  3. Row-granular lazy decryption - predicates execute on cheap structural columns first. Only the individual rows that survive the predicate trigger KEM decapsulation and AES-GCM decryption.

  4. O(1) manifest lookup - a flat crypto manifest in the footer maps any row to its page key via pointer arithmetic.

Performance - Honest Three-Baseline Comparison

Benchmarked on Kaggle Xeon CPU (4 cores), 1M rows, real ML-KEM-768 + AES-256-GCM.

Two baselines are measured, not estimated:

  • Naive per-row PQC - row-level ML-KEM encapsulation. Establishes the upper bound of the problem. This is what a quick liboqs integration produces.
  • Competent per-page PQC - the correct hybrid KEM construction (per-page ML-KEM + AES-GCM, exactly like QPQT) but stored in a plain layout with no column separation and no lazy decryption. It decrypts every row in the queried column because decryption is chunk-granular. This isolates QPQT's actual contribution.
Selectivity Naive per-row Competent per-page QPQT QPQT vs competent
1% 9,600ms 2,150ms 78ms 27.6x
5% 9,600ms 2,111ms 163ms 12.9x
10% 9,600ms 2,113ms 264ms 8.0x
25% 9,600ms 2,103ms 557ms 3.8x
50% 9,600ms 2,148ms 1,055ms 2.0x
100% 9,600ms 2,147ms 2,098ms 1.02x (no advantage)

Reading this table honestly:

QPQT's contribution is row-granular lazy decryption. At low selectivity - the common case for analytical queries - it decrypts orders of magnitude fewer rows than a competent columnar-unaware implementation, giving 8-27x.

As selectivity approaches 100%, the advantage shrinks to parity: when every row survives the predicate, QPQT and the competent baseline do identical work. At 100% selectivity QPQT offers no advantage over competent per-page PQC - and that is expected, because there is nothing to skip.

The win is real precisely where real queries live: selective filters on large tables. It is not a universal speedup, and the methodology isolates exactly what QPQT adds versus what any competent PQC implementation would already do.

Other measured numbers:

Metric Value
Write throughput (1M rows) 534K rows/sec (1,871ms)
Structural scan (no crypto) 5ms, 188M rows/sec
File size (1M rows) 80MB
Storage vs naive per-row ML-KEM 80MB vs ~1,084MB (92% reduction)

Cryptographic Design

ML-KEM-768 keypair  ->  secret key stored in KMS (file holds only key_id)
                                |
                        Per page (4,096 rows):
                        ML-KEM-768 encapsulate(public_key)
                            |-- kem_ciphertext  ->  CRYPTO MANIFEST
                            +-- shared_secret (32 bytes)
                                        |
                                HKDF-SHA3-256(shared_secret, page_context)
                                        +-- aes_page_key (32 bytes, unique per page)
                                                    |
                                            AES-256-GCM per row
                                            |-- IV (12B, deterministic)
                                            |-- ciphertext (= plaintext length)
                                            +-- auth_tag (16B, tamper detection)

IV construction and GCM nonce safety

QPQT uses deterministic AES-GCM IVs. This is safe because nonce uniqueness is guaranteed within every key scope. Each 4,096-row page derives its own unique AES-256 key via ML-KEM encapsulation + HKDF-SHA3-256. The IV only needs to be unique under a given key, and within a single page key the (row_index, column_index) tuple is unique by construction. The file_uuid component prevents cross-file collision in the event a page key is ever reused across files. There is no nonce reuse under any single key - the failure mode that breaks GCM does not occur.

All components are NIST-approved and quantum-safe:

  • ML-KEM-768: FIPS 203 (replaces RSA/ECDH for key establishment)
  • AES-256-GCM: FIPS 197 (quantum-safe symmetrically; Grover's only halves the effective key strength, leaving 128-bit security)
  • HKDF-SHA3-256: SP 800-56C

Why a Separate Format (and not Parquet)?

A reasonable question: Parquet already has Modular Encryption - why not derive its AES key from ML-KEM and get quantum-safe Parquet today?

For encryption alone, you could. Parquet Modular Encryption does per-column AES-GCM and you could wrap the key with ML-KEM. The encryption is not the contribution.

The contribution is row-granular lazy decryption. Parquet does support predicate pushdown and can skip entire encrypted column chunks or row groups via footer statistics - that is real and valuable. What it cannot do is decrypt only the surviving rows within a chunk that the predicate did not eliminate wholesale. Parquet decrypts at chunk granularity, not surviving-row granularity. Closing that specific gap is what requires a format where structural columns are physically separated (so the predicate runs before any decryption) and where a manifest addresses individual rows' page keys.

QPQT is a purpose-built format for organizations that need PQC-protected columnar data with row-granular lazy decryption. Existing tools integrate via the CLI, Python bindings, and Arrow export rather than reading .qpqt natively.

File Format

+-----------------------------------------------------+
| FILE HEADER (48 bytes)                              |
| magic + version + file_uuid + total_rows + offsets  |
+-----------------------------------------------------+
| SCHEMA BLOCK (variable)                             |
+-----------------------------------------------------+
| KEY REFERENCE BLOCK (32 bytes) - key_id, not the key|
+-----------------------------------------------------+
| ROW GROUP 0  (100,000 rows)                         |
|  |-- SECTION 1: Structural columns (unencrypted)    |
|  |   [tightly packed, padded to 4KB boundary]       |
|  +-- SECTION 2: PQC columns (AES-256-GCM per row)   |
|      [starts on 4KB OS page boundary]               |
+-----------------------------------------------------+
| ROW GROUP 1 ... N                                   |
+-----------------------------------------------------+
| FILE FOOTER                                         |
|  |-- Row group offset table                         |
|  |-- CRYPTO MANIFEST (flat array, O(1) lookup)      |
|  +-- FOOTER HEADER (40 bytes) + CRC32               |
+-----------------------------------------------------+

Key Management

./qpqt keygen --out-pub pub.bin --out-sec sec.bin
  • pub.bin - ML-KEM-768 public key (1184 bytes). Safe to share with writers.
  • sec.bin - ML-KEM-768 secret key (2400 bytes). Never share. Never commit.
  • pub.bin.keyid - 16-byte key ID. Pass to --key-id when encrypting.
Environment Recommended key storage
Local dev Outside repo, e.g. ~/.qpqt/keys/
AWS AWS KMS + Secrets Manager
Azure Azure Key Vault
GCP Cloud KMS
Databricks dbutils.secrets
On-premise HashiCorp Vault or HSM

QPQT stores a key_id reference in the file header, not the key itself, so key rotation never requires rewriting existing data files.

Build

Prerequisites

  • Ubuntu 22.04 or Debian 12
  • CMake 3.16+, OpenSSL 3.x, C++17 compiler with OpenMP

Steps

bash scripts/install_deps.sh        # installs liboqs from source
mkdir build && cd build
cmake .. && make -j$(nproc)
./qpqt_tests

Ecosystem Integration

Tool How
CLI qpqt encrypt/decrypt/inspect on CSV (Parquet with Arrow build)
Python / pandas pip install . then import qpqt
DuckDB / Polars / Spark qpqt_arrow export produces structural columns as Arrow IPC

License

MIT

Author

Rohan Prabhakar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

qpqt-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

qpqt-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

qpqt-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

qpqt-0.1.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

qpqt-0.1.0-cp38-cp38-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

File details

Details for the file qpqt-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 645c7d501ff5dc0aec5e06a3a3a920c8758aadeab666d4079decc5c74511945a
MD5 83ebf18158cfe24f0ec7794b2284429b
BLAKE2b-256 8c68ba114ef45e681cda5f9c8592ff0a11071102b6446ca3ab61980ae99ce857

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ab9e30a55fd97c8c79e477b59a471d4b2e04cfaa58ebef0a11de43393ab07994
MD5 0f83461e1cf42f9b8af9c6235126dff1
BLAKE2b-256 6c4ea85b15db96d79199393d7085a797e06ba80f0d6f784668206c0bdfd8d911

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for qpqt-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9489fcbbdb2a8b955e36f0798acdff193caea654ac3c6a8411e873c0abcb231a
MD5 51387d6031a59b871b49350f645bef6a
BLAKE2b-256 bf03aa481d102cfa49aace165fa6cbff6b790f26bc0d98c42d2a5d179bb6d3e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.1.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: qpqt-0.1.0-cp39-cp39-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 4.0 MB
  • Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qpqt-0.1.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 df21de8736643e5f85d114768c85f07d618953824b7868be4e5860d3e2cccd06
MD5 9ad96acbbbc9e80e9b99b983e66e8996
BLAKE2b-256 78d1cdd8d884c8b926777eec2302bd54beb340009ef198329ebfd73eb1308339

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.0-cp39-cp39-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qpqt-0.1.0-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: qpqt-0.1.0-cp38-cp38-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 4.0 MB
  • Tags: CPython 3.8, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qpqt-0.1.0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b8ec899749f8f60595259ec22c1ada1a2e9455e7949d7291866945c86b370e72
MD5 f1ddc0617c1ad06fc125ce8a571a9f53
BLAKE2b-256 c78372b1b275f6418c6134cc48981be9cd85d802e7de62d61d345adc572943d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for qpqt-0.1.0-cp38-cp38-manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on Rohan-Prabhakar/QPQT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page