DataKrypto FHEnom for AI™ — Automated POC Head-to-Head Test Suite for encrypted vs. plaintext model validation
Project description
dk-test-suite
DataKrypto FHEnom for AI™ — Automated POC Head-to-Head Test Suite
dk-test-suite is the official validation framework for FHEnom for AI™ deployments.
It runs automated, side-by-side tests comparing a plaintext (clear) model against its
FHE-encrypted counterpart — proving that encryption introduces zero degradation to
inference quality while providing full confidentiality at rest, in transit, and in use.
The suite produces a self-contained HTML report with DataKrypto-branded tables, charts, and per-prompt response comparisons.
Features
- 30 automated tests across 6 categories (PERF, ACC, SCALE, SEC, SERV, T)
- Sequential single-port architecture — clear and encrypted models share port 8000; no 2× VRAM requirement
- Local or remote execution — run directly on the GPU server or from a separate machine via SSH
- HTML report — DataKrypto-styled comparison with pass/fail badges, expandable details, metric explainers, and Chart.js visualizations
- Deterministic prompts — seeded prompt generation for reproducible runs
- Streaming inference — TTFT measurement through SSE streaming endpoints
- lm-eval-harness integration — MMLU, HellaSwag, GSM8K, HumanEval benchmarks
- BERTScore semantic similarity — optional deep comparison via
bert-score - 3-way encryption proof — SERV-4/5/6 prove encryption is real (TEE coherent, bypass garbled, fidelity maintained)
Installation
From PyPI (recommended):
pip install dk-test-suite
With optional accuracy dependencies:
pip install "dk-test-suite[accuracy]"
This adds lm-eval, bert-score, and deepeval for ACC-2 (lm-eval benchmarks)
and ACC-3 (BERTScore semantic similarity). Without these, the tool falls back to
word-level similarity measures and marks lm-eval tests as skipped.
From source:
git clone https://github.com/datakrypto/dk-test-suite.git
cd dk-test-suite
pip install -e ".[all]"
Quick Start
1. Create a Configuration File
Create config.yaml with your deployment details:
# Run directly on the GPU machine (most common)
local_mode: true
# TEE machine
tee_host: "10.0.0.2"
tee_admin_token: "<admin-token>"
tee_user_token: "<user-token>"
# Model identifiers
model_name: "Llama-3.2-1B-Instruct"
encrypted_model_name: "Llama-3.2-1B-Instruct-encrypted"
encrypted_model_id: "<UUID from fhenomai model list>"
# Model paths on the GPU machine
model_path_clear: "/home/user/models/Llama-3.2-1B-Instruct"
model_path_encrypted: "/home/user/models/Llama-3.2-1B-Instruct-encrypted"
# FHEnom venv and home directory
fhenomai_venv: "/home/user/venv"
gpu_home: "/home/user"
2. Run the Full Suite
dk-test run -c config.yaml
3. View the Report
ls results/poc_report_*.html
Open the HTML file in a browser to see the full comparison report.
CLI Reference
The dk-test command provides four subcommands:
dk-test run
Run the full POC test suite or a subset of categories.
dk-test run [OPTIONS]
| Option | Description |
|---|---|
-c, --config PATH |
Path to a YAML configuration file |
--gpu-host HOST |
Override GPU machine IP |
--tee-host HOST |
Override TEE machine IP |
--model NAME |
Override model name |
--output DIR |
Output directory for results (default: ./results) |
-t, --categories CAT |
Test categories to run (repeatable). Choices: performance, accuracy, scalability, security, serving, training. Default: all |
--num-prompts N |
Number of benchmark prompts (default: from config) |
--skip-clear-vllm |
Assume vLLM is already running with the clear model |
--skip-encrypted-vllm |
Assume vLLM is already running with the encrypted model and TEE is serving |
--local |
Run on this machine (no SSH to GPU) |
--gpu-ssh-key PATH |
Override SSH private key for the GPU |
--tee-ssh-key PATH |
Override SSH private key for the TEE |
-v, --verbose |
Enable debug logging |
Examples:
# Full suite with runner-managed vLLM lifecycle
dk-test run -c config.yaml
# Quick smoke test: serving + security only (5-10 min)
dk-test run -c config.yaml -t serving -t security --skip-clear-vllm --skip-encrypted-vllm
# Performance and accuracy only, 50 prompts, verbose
dk-test run -c config.yaml -t performance -t accuracy --num-prompts 50 -v
# Use pre-running vLLM instances
dk-test run -c config.yaml --skip-clear-vllm --skip-encrypted-vllm
dk-test scale4-sustained
Run the full 24-hour sustained operation test (SCALE-4).
dk-test scale4-sustained [OPTIONS]
| Option | Description |
|---|---|
-c, --config PATH |
Path to a YAML configuration file |
--hours N |
Duration in hours (default: 24) |
--local |
Run locally (no SSH to GPU) |
-v, --verbose |
Enable debug logging |
This command runs continuous inference against both the clear and encrypted models for the specified duration, measuring throughput degradation over time. It is designed for overnight or multi-day soak testing.
dk-test scale4-sustained -c config.yaml --hours 24
dk-test report
Regenerate the HTML report from existing JSON result files.
dk-test report RESULTS_DIR [OPTIONS]
| Argument / Option | Description |
|---|---|
RESULTS_DIR |
Directory containing *_results.json files from a previous run |
-c, --config PATH |
Path to a YAML configuration file |
Useful when you want to regenerate the report with updated formatting or after manually editing result files.
dk-test report ./results -c config.yaml
dk-test info
Display the current configuration and test connectivity to GPU and TEE machines.
dk-test info [OPTIONS]
| Option | Description |
|---|---|
-c, --config PATH |
Path to a YAML configuration file |
--local |
Show local configuration (no SSH connectivity check) |
dk-test info -c config.yaml
dk-test info -c config.yaml --local
dk-test info --local # shows defaults + lists missing required values
dk-test --version
Print the installed version.
dk-test --version
Test Categories
Performance (PERF-1 — PERF-5)
Measures computational overhead introduced by FHE encryption. All tests are measurement-only (always PASS) and document the delta between clear and encrypted.
| ID | Metric | Method |
|---|---|---|
| PERF-1 | Time to First Token (TTFT) | Streaming SSE, first data: chunk timestamp |
| PERF-2 | Throughput (tokens/sec) | Total tokens ÷ wall time from streaming responses |
| PERF-3 | End-to-End Response Time | Wall-clock time for full non-streaming completion |
| PERF-4 | Model Footprint (disk) | du -sb on clear vs encrypted model directories |
| PERF-5 | GPU Memory Utilization | nvidia-smi VRAM snapshot during inference |
Includes a warmup request before measurement to avoid CUDA JIT and KV-cache allocation artifacts.
Accuracy (ACC-1 — ACC-5)
Validates that FHE encryption introduces no degradation to model output quality.
| ID | Metric | Method |
|---|---|---|
| ACC-1 | Deterministic Equivalence | Token-level LCS similarity using the model's tokenizer. Falls back to word-level if tokenizer is unavailable. Threshold: configurable (default ≥ 0.20) |
| ACC-2 | Functional Equivalence | lm-eval-harness benchmarks (MMLU, HellaSwag, GSM8K, HumanEval for clear; TriviaQA for TEE HTTPS endpoint). Requires pip install dk-test-suite[accuracy] |
| ACC-3 | Semantic Consistency | BERTScore F1 between clear and encrypted outputs. Falls back to word-level SequenceMatcher when bert-score is not installed |
| ACC-4 | Response Length Consistency | Two-sample t-test on response lengths. Fails if p-value < 0.01 (statistically significant length difference) |
| ACC-5 | Perplexity / Log-Likelihood | Average perplexity from logprobs. TEE does not support logprobs — marked as N/A (known limitation) |
Scalability (SCALE-1 — SCALE-4)
Assesses whether FHE encryption introduces constraints on scaling behavior.
| ID | Metric | Method |
|---|---|---|
| SCALE-1 | Concurrent User Load | Parallel requests at 1, 5, 10, 20, 50 concurrent users via ThreadPoolExecutor |
| SCALE-2 | Context Length Scaling | Inference at 512, 1024, 2048, 4096, 8192 token inputs (3 iterations averaged) |
| SCALE-3 | Batch Processing | Sequential throughput at batch sizes 1, 4, 8, 16, 32 |
| SCALE-4 | Sustained Operation | Abbreviated: 3-minute continuous run. Full: 24h via dk-test scale4-sustained |
Security (SEC-1 — SEC-6)
Automated adversarial testing of the encrypted deployment. These tests have hard pass/fail criteria.
| ID | Test | Pass Condition |
|---|---|---|
| SEC-1 | Encryption at Rest | No plaintext weight patterns in safetensors data sections; binary entropy ≥ 7.5 bits/byte |
| SEC-2 | Encrypted Execution | No plaintext weight patterns in process memory maps (/proc/PID/maps) |
| SEC-3 | Secure Transport | TEE endpoint uses HTTPS; tcpdump captures show no plaintext model data |
| SEC-4 | Model Binding | Encrypted model cannot be loaded with transformers.AutoModelForCausalLM outside FHEnom, or produces non-meaningful output |
| SEC-5 | Key Isolation | No FHEnom key material in files, environment variables, or Docker configuration on the GPU host |
| SEC-6 | Logging Safety | No plaintext weight data or sensitive information in vLLM container logs or system journals |
Serving (SERV-1 — SERV-7)
End-to-end validation of the FHE-encrypted serving pipeline. These tests have hard pass/fail criteria and implement a 3-way encryption proof.
| ID | Test | Pass Condition |
|---|---|---|
| SERV-1 | Encrypted Model File Integrity | config.json, tokenizer files, ≥1 .safetensors file, total size ≥ 1.5 GB |
| SERV-2 | vLLM Server Health | HTTP 200 on /health endpoint |
| SERV-3 | TEE Serving Status | Encrypted model visible via admin API, user /v1/models, or inference probe |
| SERV-4 | TEE Inference Coherence | TEE output has space ratio ≥ 0.08 (coherent English text) |
| SERV-5 | Encryption Reality Proof | Direct bypass of TEE produces garbled output (space ratio < 0.04) |
| SERV-6 | Output Fidelity | SequenceMatcher ratio ≥ 0.70 between TEE and clear model for all probe prompts |
| SERV-7 | FHE Overhead | Encryption overhead < 10ms, decryption overhead < 5ms per request |
3-Way Proof Architecture:
[Path 1 — TEE] User → TEE:9999 → Encrypt → vLLM:8000 → TEE → Decrypt → User ✅ Coherent
[Path 2 — Clear] User → vLLM:8000 (clear weights) → User ✅ Coherent
[Path 3 — Bypass] User → vLLM:8000 (encrypted weights, no TEE) → User ❌ Garbled
SERV-4 proves Path 1 works. SERV-5 proves Path 3 fails. SERV-6 proves Path 1 ≈ Path 2. Together, they demonstrate that encryption is real and the TEE correctly handles encryption/decryption.
Training (T-1 — T-3)
Validates secure fine-tuning of encrypted models (Extended POC). These tests
require training.enabled: true and a training.dataset_path in the configuration.
Skipped by default.
| ID | Test | Description |
|---|---|---|
| T-1 | Secure Checkpoints | Verifies encrypted fine-tuning checkpoints exist and have high entropy |
| T-2 | Convergence | Compares loss curves between clear and encrypted fine-tuning |
| T-3 | Inference Quality | Tests inference on fine-tuned checkpoints |
Configuration Reference
Configuration is loaded in the following order of precedence (highest first):
- CLI flags (
--gpu-host,--tee-host,--model,--num-prompts, etc.) - Custom YAML file (
-c my_config.yaml) - Environment variables (
DK_GPU_HOST,DK_TEE_HOST,DK_MODEL_NAME,DK_OUTPUT_DIR) - Built-in defaults (
config/default.yaml)
Required Parameters
| Parameter | Description |
|---|---|
tee_host |
IP address of the TEE machine |
tee_admin_token |
Admin authentication token for the TEE (provided by DataKrypto) |
tee_user_token |
User/inference authentication token for the TEE (provided by DataKrypto) |
model_name |
Clear model name (e.g., Llama-3.2-1B-Instruct) |
model_path_clear |
Absolute path to the plaintext model weights on the GPU machine |
model_path_encrypted |
Absolute path to the encrypted model weights on the GPU machine |
encrypted_model_id |
Model ID assigned by FHEnom (from fhenomai model list) |
encrypted_model_name |
Served model name for the encrypted vLLM instance |
fhenomai_venv |
Path to the Python venv containing the fhenomai CLI |
gpu_home |
HOME directory for fhenomai commands on the GPU machine |
Remote Mode Parameters
Required only when local_mode: false (running from a separate machine):
| Parameter | Description |
|---|---|
gpu_host |
IP address of the GPU machine |
gpu_ssh_user |
SSH username for the GPU machine |
gpu_ssh_key |
Path to the SSH private key for the GPU machine |
Optional Parameters
| Parameter | Default | Description |
|---|---|---|
local_mode |
true |
Run GPU commands locally (no SSH). Set false for remote execution |
tee_admin_port |
9099 |
TEE admin API port |
tee_user_port |
9999 |
TEE user/inference API port |
tee_ssh_key |
(empty) | SSH key for TEE host (enables getpwuid fix) |
vllm_image |
vllm/vllm-openai:latest |
Docker image for vLLM |
vllm_port |
8000 |
Port shared sequentially by clear and encrypted instances |
vllm_gpu_memory_utilization |
0.5 |
GPU memory fraction for vLLM |
vllm_max_model_len |
8192 |
Maximum sequence length |
vllm_tensor_parallel_size |
1 |
Number of GPUs for tensor parallelism |
vllm_startup_timeout |
300 |
Seconds to wait for vLLM to become ready |
vllm_request_timeout |
120 |
Timeout per inference request |
temperature |
0 |
Inference temperature (0 = deterministic) |
num_prompts |
10 |
Number of benchmark prompts |
seed |
42 |
Random seed for prompt generation |
fhe_enc_overhead_threshold_ms |
10.0 |
SERV-7 encryption overhead threshold |
fhe_dec_overhead_threshold_ms |
5.0 |
SERV-7 decryption overhead threshold |
output_dir |
./results |
Output directory for JSON results and HTML report |
customer_name |
(empty) | Customer name shown in the report header |
poc_id |
(empty) | POC identifier shown in the report header |
Scalability Parameters
scalability:
concurrent_levels: [1, 5, 10, 20, 50]
context_lengths: [512, 1024, 2048, 4096, 8192]
batch_sizes: [1, 4, 8, 16, 32]
sustained_duration_hours: 24
sustained_abbreviated_minutes: 3
Accuracy Benchmarks
accuracy_benchmarks:
- "mmlu"
- "hellaswag"
- "gsm8k"
- "humaneval"
Execution Flow
dk-test run
│
├── Phase 1: Clear Model
│ ├── Start vLLM container (clear model, port 8000)
│ ├── Wait for /v1/models readiness
│ ├── Run PERF / ACC / SCALE clear-side tests
│ ├── Collect SERV clear-model reference probes
│ ├── Save intermediate results
│ └── Stop vLLM container
│
├── Phase 2: Encrypted Model
│ ├── Start vLLM container (encrypted model, port 8000)
│ ├── Wait for /v1/models readiness
│ ├── Initialize fhenomai CLI config
│ ├── Start TEE serving (fhenomai serve start)
│ ├── Verify model is ONLINE (fhenomai model list --show-status)
│ ├── Run PERF / ACC / SCALE encrypted-side tests (via TEE)
│ └── Save intermediate results
│
├── Phase 3: Security Validation
│ └── Run SEC-1 through SEC-6
│
├── Phase 4: Serving Workflow
│ └── Run SERV-1 through SERV-7 (encrypted probes + 3-way comparison)
│
├── Phase 5: Training (Extended POC)
│ └── Run T-1 through T-3 (if enabled)
│
├── Teardown
│ ├── Stop vLLM container
│ ├── Stop TEE serving (fhenomai serve stop)
│ └── Verify model is OFFLINE
│
└── Report Generation
├── Side-by-side comparison tables
├── Pass/fail badges per test
├── Per-prompt response viewer
└── Chart.js performance visualizations
Environment Variables
| Variable | Maps To |
|---|---|
DK_GPU_HOST |
gpu_host |
DK_TEE_HOST |
tee_host |
DK_MODEL_NAME |
model_name |
DK_OUTPUT_DIR |
output_dir |
Dependencies
Core (installed automatically):
click · pyyaml · httpx · aiohttp · numpy · scipy · jinja2 · rich ·
paramiko · psutil · tenacity
Optional — accuracy (pip install dk-test-suite[accuracy]):
lm-eval · bert-score · deepeval
Optional — development (pip install dk-test-suite[dev]):
pytest · black · flake8 · mypy · isort
Runtime requirements:
- Python ≥ 3.10
- Docker (for vLLM container management)
- NVIDIA GPU with CUDA drivers (for inference)
- Network access to the TEE machine (HTTPS)
fhenomaiCLI installed in a Python venv on the GPU machine
Output
Each run produces:
| File | Description |
|---|---|
results/poc_report_YYYYMMDD_HHMMSS.html |
Self-contained HTML report |
results/performance_results.json |
PERF-1 through PERF-5 raw data |
results/accuracy_results.json |
ACC-1 through ACC-5 raw data |
results/scalability_results.json |
SCALE-1 through SCALE-4 raw data |
results/security_results.json |
SEC-1 through SEC-6 raw data |
results/serving_results.json |
SERV-1 through SERV-7 raw data |
results/training_results.json |
T-1 through T-3 raw data |
results/prompt_set.json |
Deterministic prompt set used for the run |
License
MIT License — Copyright © 2025 DataKrypto.
Links
- Website: datakrypto.ai
- Documentation: docs.datakrypto.ai
- LinkedIn: linkedin.com/company/datakrypto
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dk_test_suite-1.0.3.tar.gz.
File metadata
- Download URL: dk_test_suite-1.0.3.tar.gz
- Upload date:
- Size: 121.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e32cd73529d28bee0199c2a86085bc63572a31a11f9363b6d881d4663795a831
|
|
| MD5 |
63da5f088407b11218df593b16cb5f21
|
|
| BLAKE2b-256 |
863a78fb77e813cd3032d7075f4c90ea049bc84756a19d0ee0e90813fbb5e2a5
|
File details
Details for the file dk_test_suite-1.0.3-py3-none-any.whl.
File metadata
- Download URL: dk_test_suite-1.0.3-py3-none-any.whl
- Upload date:
- Size: 94.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2f3e5052e6ccf68a3beae8a5a71f2157b4a008c366a65cbb5ca8b51479f6c7a
|
|
| MD5 |
086ed35470d6e911c1b00c5913145e1b
|
|
| BLAKE2b-256 |
14a5cb521c531c6a3e38b21f33ce14daf056f61c6be36214757878f51132a5b9
|