Skip to main content

DataKrypto FHEnom for AI™ — Automated POC Head-to-Head Test Suite for encrypted vs. plaintext model validation

Project description

dk-test-suite

DataKrypto FHEnom for AI™ — Automated POC Head-to-Head Test Suite

dk-test-suite is the official validation framework for FHEnom for AI™ deployments. It runs automated, side-by-side tests comparing a plaintext (clear) model against its FHE-encrypted counterpart — proving that encryption introduces zero degradation to inference quality while providing full confidentiality at rest, in transit, and in use.

The suite produces a self-contained HTML report with DataKrypto-branded tables, charts, and per-prompt response comparisons.


Features

  • 30 automated tests across 6 categories (PERF, ACC, SCALE, SEC, SERV, T)
  • Sequential single-port architecture — clear and encrypted models share port 8000; no 2× VRAM requirement
  • Local or remote execution — run directly on the GPU server or from a separate machine via SSH
  • HTML report — DataKrypto-styled comparison with pass/fail badges, expandable details, metric explainers, and Chart.js visualizations
  • Deterministic prompts — seeded prompt generation for reproducible runs
  • Streaming inference — TTFT measurement through SSE streaming endpoints
  • lm-eval-harness integration — MMLU, HellaSwag, GSM8K, HumanEval benchmarks
  • BERTScore semantic similarity — optional deep comparison via bert-score
  • 3-way encryption proof — SERV-4/5/6 prove encryption is real (TEE coherent, bypass garbled, fidelity maintained)

Installation

From PyPI (recommended):

pip install dk-test-suite

With optional accuracy dependencies:

pip install "dk-test-suite[accuracy]"

This adds lm-eval, bert-score, and deepeval for ACC-2 (lm-eval benchmarks) and ACC-3 (BERTScore semantic similarity). Without these, the tool falls back to word-level similarity measures and marks lm-eval tests as skipped.

From source:

git clone https://github.com/datakrypto/dk-test-suite.git
cd dk-test-suite
pip install -e ".[all]"

Quick Start

1. Create a Configuration File

Create config.yaml with your deployment details:

# Run directly on the GPU machine (most common)
local_mode: true

# TEE machine
tee_host: "10.0.0.2"
tee_admin_token: "<admin-token>"
tee_user_token: "<user-token>"

# Model identifiers
model_name: "Llama-3.2-1B-Instruct"
encrypted_model_name: "Llama-3.2-1B-Instruct-encrypted"
encrypted_model_id: "<UUID from fhenomai model list>"

# Model paths on the GPU machine
model_path_clear: "/home/user/models/Llama-3.2-1B-Instruct"
model_path_encrypted: "/home/user/models/Llama-3.2-1B-Instruct-encrypted"

# FHEnom venv and home directory
fhenomai_venv: "/home/user/venv"
gpu_home: "/home/user"

2. Run the Full Suite

dk-test run -c config.yaml

3. View the Report

ls results/poc_report_*.html

Open the HTML file in a browser to see the full comparison report.


CLI Reference

The dk-test command provides four subcommands:

dk-test run

Run the full POC test suite or a subset of categories.

dk-test run [OPTIONS]
Option Description
-c, --config PATH Path to a YAML configuration file
--gpu-host HOST Override GPU machine IP
--tee-host HOST Override TEE machine IP
--model NAME Override model name
--output DIR Output directory for results (default: ./results)
-t, --categories CAT Test categories to run (repeatable). Choices: performance, accuracy, scalability, security, serving, training. Default: all
--num-prompts N Number of benchmark prompts (default: from config)
--skip-clear-vllm Assume vLLM is already running with the clear model
--skip-encrypted-vllm Assume vLLM is already running with the encrypted model and TEE is serving
--local Run on this machine (no SSH to GPU)
--gpu-ssh-key PATH Override SSH private key for the GPU
--tee-ssh-key PATH Override SSH private key for the TEE
-v, --verbose Enable debug logging

Examples:

# Full suite with runner-managed vLLM lifecycle
dk-test run -c config.yaml

# Quick smoke test: serving + security only (5-10 min)
dk-test run -c config.yaml -t serving -t security --skip-clear-vllm --skip-encrypted-vllm

# Performance and accuracy only, 50 prompts, verbose
dk-test run -c config.yaml -t performance -t accuracy --num-prompts 50 -v

# Use pre-running vLLM instances
dk-test run -c config.yaml --skip-clear-vllm --skip-encrypted-vllm

dk-test scale4-sustained

Run the full 24-hour sustained operation test (SCALE-4).

dk-test scale4-sustained [OPTIONS]
Option Description
-c, --config PATH Path to a YAML configuration file
--hours N Duration in hours (default: 24)
--local Run locally (no SSH to GPU)
-v, --verbose Enable debug logging

This command runs continuous inference against both the clear and encrypted models for the specified duration, measuring throughput degradation over time. It is designed for overnight or multi-day soak testing.

dk-test scale4-sustained -c config.yaml --hours 24

dk-test report

Regenerate the HTML report from existing JSON result files.

dk-test report RESULTS_DIR [OPTIONS]
Argument / Option Description
RESULTS_DIR Directory containing *_results.json files from a previous run
-c, --config PATH Path to a YAML configuration file

Useful when you want to regenerate the report with updated formatting or after manually editing result files.

dk-test report ./results -c config.yaml

dk-test info

Display the current configuration and test connectivity to GPU and TEE machines.

dk-test info [OPTIONS]
Option Description
-c, --config PATH Path to a YAML configuration file
--local Show local configuration (no SSH connectivity check)
dk-test info -c config.yaml
dk-test info --local

dk-test --version

Print the installed version.

dk-test --version

Test Categories

Performance (PERF-1 — PERF-5)

Measures computational overhead introduced by FHE encryption. All tests are measurement-only (always PASS) and document the delta between clear and encrypted.

ID Metric Method
PERF-1 Time to First Token (TTFT) Streaming SSE, first data: chunk timestamp
PERF-2 Throughput (tokens/sec) Total tokens ÷ wall time from streaming responses
PERF-3 End-to-End Response Time Wall-clock time for full non-streaming completion
PERF-4 Model Footprint (disk) du -sb on clear vs encrypted model directories
PERF-5 GPU Memory Utilization nvidia-smi VRAM snapshot during inference

Includes a warmup request before measurement to avoid CUDA JIT and KV-cache allocation artifacts.

Accuracy (ACC-1 — ACC-5)

Validates that FHE encryption introduces no degradation to model output quality.

ID Metric Method
ACC-1 Deterministic Equivalence Token-level LCS similarity using the model's tokenizer. Falls back to word-level if tokenizer is unavailable. Threshold: configurable (default ≥ 0.20)
ACC-2 Functional Equivalence lm-eval-harness benchmarks (MMLU, HellaSwag, GSM8K, HumanEval for clear; TriviaQA for TEE HTTPS endpoint). Requires pip install dk-test-suite[accuracy]
ACC-3 Semantic Consistency BERTScore F1 between clear and encrypted outputs. Falls back to word-level SequenceMatcher when bert-score is not installed
ACC-4 Response Length Consistency Two-sample t-test on response lengths. Fails if p-value < 0.01 (statistically significant length difference)
ACC-5 Perplexity / Log-Likelihood Average perplexity from logprobs. TEE does not support logprobs — marked as N/A (known limitation)

Scalability (SCALE-1 — SCALE-4)

Assesses whether FHE encryption introduces constraints on scaling behavior.

ID Metric Method
SCALE-1 Concurrent User Load Parallel requests at 1, 5, 10, 20, 50 concurrent users via ThreadPoolExecutor
SCALE-2 Context Length Scaling Inference at 512, 1024, 2048, 4096, 8192 token inputs (3 iterations averaged)
SCALE-3 Batch Processing Sequential throughput at batch sizes 1, 4, 8, 16, 32
SCALE-4 Sustained Operation Abbreviated: 3-minute continuous run. Full: 24h via dk-test scale4-sustained

Security (SEC-1 — SEC-6)

Automated adversarial testing of the encrypted deployment. These tests have hard pass/fail criteria.

ID Test Pass Condition
SEC-1 Encryption at Rest No plaintext weight patterns in safetensors data sections; binary entropy ≥ 7.5 bits/byte
SEC-2 Encrypted Execution No plaintext weight patterns in process memory maps (/proc/PID/maps)
SEC-3 Secure Transport TEE endpoint uses HTTPS; tcpdump captures show no plaintext model data
SEC-4 Model Binding Encrypted model cannot be loaded with transformers.AutoModelForCausalLM outside FHEnom, or produces non-meaningful output
SEC-5 Key Isolation No FHEnom key material in files, environment variables, or Docker configuration on the GPU host
SEC-6 Logging Safety No plaintext weight data or sensitive information in vLLM container logs or system journals

Serving (SERV-1 — SERV-7)

End-to-end validation of the FHE-encrypted serving pipeline. These tests have hard pass/fail criteria and implement a 3-way encryption proof.

ID Test Pass Condition
SERV-1 Encrypted Model File Integrity config.json, tokenizer files, ≥1 .safetensors file, total size ≥ 1.5 GB
SERV-2 vLLM Server Health HTTP 200 on /health endpoint
SERV-3 TEE Serving Status Encrypted model visible via admin API, user /v1/models, or inference probe
SERV-4 TEE Inference Coherence TEE output has space ratio ≥ 0.08 (coherent English text)
SERV-5 Encryption Reality Proof Direct bypass of TEE produces garbled output (space ratio < 0.04)
SERV-6 Output Fidelity SequenceMatcher ratio ≥ 0.70 between TEE and clear model for all probe prompts
SERV-7 FHE Overhead Encryption overhead < 10ms, decryption overhead < 5ms per request

3-Way Proof Architecture:

[Path 1 — TEE]      User → TEE:9999 → Encrypt → vLLM:8000 → TEE → Decrypt → User   ✅ Coherent
[Path 2 — Clear]    User → vLLM:8000 (clear weights) → User                          ✅ Coherent
[Path 3 — Bypass]   User → vLLM:8000 (encrypted weights, no TEE) → User              ❌ Garbled

SERV-4 proves Path 1 works. SERV-5 proves Path 3 fails. SERV-6 proves Path 1 ≈ Path 2. Together, they demonstrate that encryption is real and the TEE correctly handles encryption/decryption.

Training (T-1 — T-3)

Validates secure fine-tuning of encrypted models (Extended POC). These tests require training.enabled: true and a training.dataset_path in the configuration. Skipped by default.

ID Test Description
T-1 Secure Checkpoints Verifies encrypted fine-tuning checkpoints exist and have high entropy
T-2 Convergence Compares loss curves between clear and encrypted fine-tuning
T-3 Inference Quality Tests inference on fine-tuned checkpoints

Configuration Reference

Configuration is loaded in the following order of precedence (highest first):

  1. CLI flags (--gpu-host, --tee-host, --model, --num-prompts, etc.)
  2. Custom YAML file (-c my_config.yaml)
  3. Environment variables (DK_GPU_HOST, DK_TEE_HOST, DK_MODEL_NAME, DK_OUTPUT_DIR)
  4. Built-in defaults (config/default.yaml)

Required Parameters

Parameter Description
tee_host IP address of the TEE machine
tee_admin_token Admin authentication token for the TEE (provided by DataKrypto)
tee_user_token User/inference authentication token for the TEE (provided by DataKrypto)
model_name Clear model name (e.g., Llama-3.2-1B-Instruct)
model_path_clear Absolute path to the plaintext model weights on the GPU machine
model_path_encrypted Absolute path to the encrypted model weights on the GPU machine
encrypted_model_id Model ID assigned by FHEnom (from fhenomai model list)
encrypted_model_name Served model name for the encrypted vLLM instance
fhenomai_venv Path to the Python venv containing the fhenomai CLI
gpu_home HOME directory for fhenomai commands on the GPU machine

Remote Mode Parameters

Required only when local_mode: false (running from a separate machine):

Parameter Description
gpu_host IP address of the GPU machine
gpu_ssh_user SSH username for the GPU machine
gpu_ssh_key Path to the SSH private key for the GPU machine

Optional Parameters

Parameter Default Description
local_mode true Run GPU commands locally (no SSH). Set false for remote execution
tee_admin_port 9099 TEE admin API port
tee_user_port 9999 TEE user/inference API port
tee_ssh_key (empty) SSH key for TEE host (enables getpwuid fix)
vllm_image vllm/vllm-openai:latest Docker image for vLLM
vllm_port 8000 Port shared sequentially by clear and encrypted instances
vllm_gpu_memory_utilization 0.5 GPU memory fraction for vLLM
vllm_max_model_len 8192 Maximum sequence length
vllm_tensor_parallel_size 1 Number of GPUs for tensor parallelism
vllm_startup_timeout 300 Seconds to wait for vLLM to become ready
vllm_request_timeout 120 Timeout per inference request
temperature 0 Inference temperature (0 = deterministic)
num_prompts 10 Number of benchmark prompts
seed 42 Random seed for prompt generation
fhe_enc_overhead_threshold_ms 10.0 SERV-7 encryption overhead threshold
fhe_dec_overhead_threshold_ms 5.0 SERV-7 decryption overhead threshold
output_dir ./results Output directory for JSON results and HTML report
customer_name (empty) Customer name shown in the report header
poc_id (empty) POC identifier shown in the report header

Scalability Parameters

scalability:
  concurrent_levels: [1, 5, 10, 20, 50]
  context_lengths: [512, 1024, 2048, 4096, 8192]
  batch_sizes: [1, 4, 8, 16, 32]
  sustained_duration_hours: 24
  sustained_abbreviated_minutes: 3

Accuracy Benchmarks

accuracy_benchmarks:
  - "mmlu"
  - "hellaswag"
  - "gsm8k"
  - "humaneval"

Execution Flow

dk-test run
  │
  ├── Phase 1: Clear Model
  │     ├── Start vLLM container (clear model, port 8000)
  │     ├── Wait for /v1/models readiness
  │     ├── Run PERF / ACC / SCALE clear-side tests
  │     ├── Collect SERV clear-model reference probes
  │     ├── Save intermediate results
  │     └── Stop vLLM container
  │
  ├── Phase 2: Encrypted Model
  │     ├── Start vLLM container (encrypted model, port 8000)
  │     ├── Wait for /v1/models readiness
  │     ├── Initialize fhenomai CLI config
  │     ├── Start TEE serving (fhenomai serve start)
  │     ├── Verify model is ONLINE (fhenomai model list --show-status)
  │     ├── Run PERF / ACC / SCALE encrypted-side tests (via TEE)
  │     └── Save intermediate results
  │
  ├── Phase 3: Security Validation
  │     └── Run SEC-1 through SEC-6
  │
  ├── Phase 4: Serving Workflow
  │     └── Run SERV-1 through SERV-7 (encrypted probes + 3-way comparison)
  │
  ├── Phase 5: Training (Extended POC)
  │     └── Run T-1 through T-3 (if enabled)
  │
  ├── Teardown
  │     ├── Stop vLLM container
  │     ├── Stop TEE serving (fhenomai serve stop)
  │     └── Verify model is OFFLINE
  │
  └── Report Generation
        ├── Side-by-side comparison tables
        ├── Pass/fail badges per test
        ├── Per-prompt response viewer
        └── Chart.js performance visualizations

Environment Variables

Variable Maps To
DK_GPU_HOST gpu_host
DK_TEE_HOST tee_host
DK_MODEL_NAME model_name
DK_OUTPUT_DIR output_dir

Dependencies

Core (installed automatically):

click · pyyaml · httpx · aiohttp · numpy · scipy · jinja2 · rich · paramiko · psutil · tenacity

Optional — accuracy (pip install dk-test-suite[accuracy]):

lm-eval · bert-score · deepeval

Optional — development (pip install dk-test-suite[dev]):

pytest · black · flake8 · mypy · isort

Runtime requirements:

  • Python ≥ 3.10
  • Docker (for vLLM container management)
  • NVIDIA GPU with CUDA drivers (for inference)
  • Network access to the TEE machine (HTTPS)
  • fhenomai CLI installed in a Python venv on the GPU machine

Output

Each run produces:

File Description
results/poc_report_YYYYMMDD_HHMMSS.html Self-contained HTML report
results/performance_results.json PERF-1 through PERF-5 raw data
results/accuracy_results.json ACC-1 through ACC-5 raw data
results/scalability_results.json SCALE-1 through SCALE-4 raw data
results/security_results.json SEC-1 through SEC-6 raw data
results/serving_results.json SERV-1 through SERV-7 raw data
results/training_results.json T-1 through T-3 raw data
results/prompt_set.json Deterministic prompt set used for the run

License

MIT License — Copyright © 2025 DataKrypto.


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dk_test_suite-1.0.2.tar.gz (121.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dk_test_suite-1.0.2-py3-none-any.whl (94.6 kB view details)

Uploaded Python 3

File details

Details for the file dk_test_suite-1.0.2.tar.gz.

File metadata

  • Download URL: dk_test_suite-1.0.2.tar.gz
  • Upload date:
  • Size: 121.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dk_test_suite-1.0.2.tar.gz
Algorithm Hash digest
SHA256 29c90241b93036b710b104ceb33ad57fa9fe44176d6e2fb16ebebc695641926d
MD5 b5292803cf42d1999e0714545dc21806
BLAKE2b-256 5b9fee4b6161e883cf30686a00ab7711d6ed6378885a17c6ed4815333e891fa6

See more details on using hashes here.

File details

Details for the file dk_test_suite-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: dk_test_suite-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 94.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dk_test_suite-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4bc086c97be9f215df90651f242d06089ed3afffe645d53946d902839d1b87f8
MD5 9dfc42b19cbebe96dd9ef4c2c38fca5a
BLAKE2b-256 26fd99b26f317ec984caae7233dafd2a89bb8e3152539950271e96477d781e52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page