DataKrypto FHEnom for AI™ — Automated POC Head-to-Head Test Suite for encrypted vs. plaintext model validation

These details have not been verified by PyPI

Project links

Project description

dk-test-suite

DataKrypto FHEnom for AI™ — Automated POC Head-to-Head Test Suite

dk-test-suite is the official validation framework for FHEnom for AI™ deployments. It runs automated, side-by-side tests comparing a plaintext (clear) model against its FHE-encrypted counterpart — proving that encryption introduces zero degradation to inference quality while providing full confidentiality at rest, in transit, and in use.

The suite produces a self-contained HTML report with DataKrypto-branded tables, charts, and per-prompt response comparisons.

Features

30 automated tests across 6 categories (PERF, ACC, SCALE, SEC, SERV, T)
Sequential single-port architecture — clear and encrypted models share port 8000; no 2× VRAM requirement
Local or remote execution — run directly on the GPU server or from a separate machine via SSH
HTML report — DataKrypto-styled comparison with pass/fail badges, expandable details, metric explainers, and Chart.js visualizations
Deterministic prompts — seeded prompt generation for reproducible runs
Streaming inference — TTFT measurement through SSE streaming endpoints
lm-eval-harness integration — MMLU, HellaSwag, GSM8K, HumanEval benchmarks
BERTScore semantic similarity — optional deep comparison via bert-score
3-way encryption proof — SERV-4/5/6 prove encryption is real (TEE coherent, bypass garbled, fidelity maintained)

Installation

From PyPI (recommended):

pip install dk-test-suite

With optional accuracy dependencies:

pip install "dk-test-suite[accuracy]"

This adds lm-eval, bert-score, and deepeval for ACC-2 (lm-eval benchmarks) and ACC-3 (BERTScore semantic similarity). Without these, the tool falls back to word-level similarity measures and marks lm-eval tests as skipped.

From source:

git clone https://github.com/datakrypto/dk-test-suite.git
cd dk-test-suite
pip install -e ".[all]"

Quick Start

1. Create a Configuration File

Create config.yaml with your deployment details:

# Run directly on the GPU machine (most common)
local_mode: true

# TEE machine
tee_host: "10.0.0.2"
tee_admin_token: "<admin-token>"
tee_user_token: "<user-token>"

# Model identifiers
model_name: "Llama-3.2-1B-Instruct"
encrypted_model_name: "Llama-3.2-1B-Instruct-encrypted"
encrypted_model_id: "<UUID from fhenomai model list>"

# Model paths on the GPU machine
model_path_clear: "/home/user/models/Llama-3.2-1B-Instruct"
model_path_encrypted: "/home/user/models/Llama-3.2-1B-Instruct-encrypted"

# FHEnom venv and home directory
fhenomai_venv: "/home/user/venv"
gpu_home: "/home/user"

2. Run the Full Suite

dk-test run -c config.yaml

3. View the Report

ls results/poc_report_*.html

Open the HTML file in a browser to see the full comparison report.

CLI Reference

The dk-test command provides four subcommands:

`dk-test run`

Run the full POC test suite or a subset of categories.

dk-test run [OPTIONS]

Option	Description
`-c, --config PATH`	Path to a YAML configuration file
`--gpu-host HOST`	Override GPU machine IP
`--tee-host HOST`	Override TEE machine IP
`--model NAME`	Override model name
`--output DIR`	Output directory for results (default: `./results`)
`-t, --categories CAT`	Test categories to run (repeatable). Choices: `performance`, `accuracy`, `scalability`, `security`, `serving`, `training`. Default: all
`--num-prompts N`	Number of benchmark prompts (default: from config)
`--skip-clear-vllm`	Assume vLLM is already running with the clear model
`--skip-encrypted-vllm`	Assume vLLM is already running with the encrypted model and TEE is serving
`--local`	Run on this machine (no SSH to GPU)
`--gpu-ssh-key PATH`	Override SSH private key for the GPU
`--tee-ssh-key PATH`	Override SSH private key for the TEE
`-v, --verbose`	Enable debug logging

Examples:

# Full suite with runner-managed vLLM lifecycle
dk-test run -c config.yaml

# Quick smoke test: serving + security only (5-10 min)
dk-test run -c config.yaml -t serving -t security --skip-clear-vllm --skip-encrypted-vllm

# Performance and accuracy only, 50 prompts, verbose
dk-test run -c config.yaml -t performance -t accuracy --num-prompts 50 -v

# Use pre-running vLLM instances
dk-test run -c config.yaml --skip-clear-vllm --skip-encrypted-vllm

`dk-test scale4-sustained`

Run the full 24-hour sustained operation test (SCALE-4).

dk-test scale4-sustained [OPTIONS]

Option	Description
`-c, --config PATH`	Path to a YAML configuration file
`--hours N`	Duration in hours (default: 24)
`--local`	Run locally (no SSH to GPU)
`-v, --verbose`	Enable debug logging

This command runs continuous inference against both the clear and encrypted models for the specified duration, measuring throughput degradation over time. It is designed for overnight or multi-day soak testing.

dk-test scale4-sustained -c config.yaml --hours 24

`dk-test report`

Regenerate the HTML report from existing JSON result files.

dk-test report RESULTS_DIR [OPTIONS]

Argument / Option	Description
`RESULTS_DIR`	Directory containing `*_results.json` files from a previous run
`-c, --config PATH`	Path to a YAML configuration file

Useful when you want to regenerate the report with updated formatting or after manually editing result files.

dk-test report ./results -c config.yaml

`dk-test info`

Display the current configuration and test connectivity to GPU and TEE machines.

dk-test info [OPTIONS]

Option	Description
`-c, --config PATH`	Path to a YAML configuration file
`--local`	Show local configuration (no SSH connectivity check)

dk-test info -c config.yaml
dk-test info --local

`dk-test --version`

Print the installed version.

dk-test --version

Test Categories

Performance (PERF-1 — PERF-5)

Measures computational overhead introduced by FHE encryption. All tests are measurement-only (always PASS) and document the delta between clear and encrypted.

ID	Metric	Method
PERF-1	Time to First Token (TTFT)	Streaming SSE, first `data:` chunk timestamp
PERF-2	Throughput (tokens/sec)	Total tokens ÷ wall time from streaming responses
PERF-3	End-to-End Response Time	Wall-clock time for full non-streaming completion
PERF-4	Model Footprint (disk)	`du -sb` on clear vs encrypted model directories
PERF-5	GPU Memory Utilization	`nvidia-smi` VRAM snapshot during inference

Includes a warmup request before measurement to avoid CUDA JIT and KV-cache allocation artifacts.

Accuracy (ACC-1 — ACC-5)

Validates that FHE encryption introduces no degradation to model output quality.

ID	Metric	Method
ACC-1	Deterministic Equivalence	Token-level LCS similarity using the model's tokenizer. Falls back to word-level if tokenizer is unavailable. Threshold: configurable (default ≥ 0.20)
ACC-2	Functional Equivalence	`lm-eval-harness` benchmarks (MMLU, HellaSwag, GSM8K, HumanEval for clear; TriviaQA for TEE HTTPS endpoint). Requires `pip install dk-test-suite[accuracy]`
ACC-3	Semantic Consistency	BERTScore F1 between clear and encrypted outputs. Falls back to word-level SequenceMatcher when `bert-score` is not installed
ACC-4	Response Length Consistency	Two-sample t-test on response lengths. Fails if p-value < 0.01 (statistically significant length difference)
ACC-5	Perplexity / Log-Likelihood	Average perplexity from logprobs. TEE does not support logprobs — marked as N/A (known limitation)

Scalability (SCALE-1 — SCALE-4)

Assesses whether FHE encryption introduces constraints on scaling behavior.

ID	Metric	Method
SCALE-1	Concurrent User Load	Parallel requests at 1, 5, 10, 20, 50 concurrent users via ThreadPoolExecutor
SCALE-2	Context Length Scaling	Inference at 512, 1024, 2048, 4096, 8192 token inputs (3 iterations averaged)
SCALE-3	Batch Processing	Sequential throughput at batch sizes 1, 4, 8, 16, 32
SCALE-4	Sustained Operation	Abbreviated: 3-minute continuous run. Full: 24h via `dk-test scale4-sustained`

Security (SEC-1 — SEC-6)

Automated adversarial testing of the encrypted deployment. These tests have hard pass/fail criteria.

ID	Test	Pass Condition
SEC-1	Encryption at Rest	No plaintext weight patterns in safetensors data sections; binary entropy ≥ 7.5 bits/byte
SEC-2	Encrypted Execution	No plaintext weight patterns in process memory maps (`/proc/PID/maps`)
SEC-3	Secure Transport	TEE endpoint uses HTTPS; `tcpdump` captures show no plaintext model data
SEC-4	Model Binding	Encrypted model cannot be loaded with `transformers.AutoModelForCausalLM` outside FHEnom, or produces non-meaningful output
SEC-5	Key Isolation	No FHEnom key material in files, environment variables, or Docker configuration on the GPU host
SEC-6	Logging Safety	No plaintext weight data or sensitive information in vLLM container logs or system journals

Serving (SERV-1 — SERV-7)

End-to-end validation of the FHE-encrypted serving pipeline. These tests have hard pass/fail criteria and implement a 3-way encryption proof.

ID	Test	Pass Condition
SERV-1	Encrypted Model File Integrity	`config.json`, tokenizer files, ≥1 `.safetensors` file, total size ≥ 1.5 GB
SERV-2	vLLM Server Health	HTTP 200 on `/health` endpoint
SERV-3	TEE Serving Status	Encrypted model visible via admin API, user `/v1/models`, or inference probe
SERV-4	TEE Inference Coherence	TEE output has space ratio ≥ 0.08 (coherent English text)
SERV-5	Encryption Reality Proof	Direct bypass of TEE produces garbled output (space ratio < 0.04)
SERV-6	Output Fidelity	SequenceMatcher ratio ≥ 0.70 between TEE and clear model for all probe prompts
SERV-7	FHE Overhead	Encryption overhead < 10ms, decryption overhead < 5ms per request

3-Way Proof Architecture:

[Path 1 — TEE]      User → TEE:9999 → Encrypt → vLLM:8000 → TEE → Decrypt → User   ✅ Coherent
[Path 2 — Clear]    User → vLLM:8000 (clear weights) → User                          ✅ Coherent
[Path 3 — Bypass]   User → vLLM:8000 (encrypted weights, no TEE) → User              ❌ Garbled

SERV-4 proves Path 1 works. SERV-5 proves Path 3 fails. SERV-6 proves Path 1 ≈ Path 2. Together, they demonstrate that encryption is real and the TEE correctly handles encryption/decryption.

Training (T-1 — T-3)

Validates secure fine-tuning of encrypted models (Extended POC). These tests require training.enabled: true and a training.dataset_path in the configuration. Skipped by default.

ID	Test	Description
T-1	Secure Checkpoints	Verifies encrypted fine-tuning checkpoints exist and have high entropy
T-2	Convergence	Compares loss curves between clear and encrypted fine-tuning
T-3	Inference Quality	Tests inference on fine-tuned checkpoints

Configuration Reference

Configuration is loaded in the following order of precedence (highest first):

CLI flags (--gpu-host, --tee-host, --model, --num-prompts, etc.)
Custom YAML file (-c my_config.yaml)
Environment variables (DK_GPU_HOST, DK_TEE_HOST, DK_MODEL_NAME, DK_OUTPUT_DIR)
Built-in defaults (config/default.yaml)

Required Parameters

Parameter	Description
`tee_host`	IP address of the TEE machine
`tee_admin_token`	Admin authentication token for the TEE (provided by DataKrypto)
`tee_user_token`	User/inference authentication token for the TEE (provided by DataKrypto)
`model_name`	Clear model name (e.g., `Llama-3.2-1B-Instruct`)
`model_path_clear`	Absolute path to the plaintext model weights on the GPU machine
`model_path_encrypted`	Absolute path to the encrypted model weights on the GPU machine
`encrypted_model_id`	Model ID assigned by FHEnom (from `fhenomai model list`)
`encrypted_model_name`	Served model name for the encrypted vLLM instance
`fhenomai_venv`	Path to the Python venv containing the `fhenomai` CLI
`gpu_home`	HOME directory for fhenomai commands on the GPU machine

Remote Mode Parameters

Required only when local_mode: false (running from a separate machine):

Parameter	Description
`gpu_host`	IP address of the GPU machine
`gpu_ssh_user`	SSH username for the GPU machine
`gpu_ssh_key`	Path to the SSH private key for the GPU machine

Optional Parameters

Parameter	Default	Description
`local_mode`	`true`	Run GPU commands locally (no SSH). Set `false` for remote execution
`tee_admin_port`	`9099`	TEE admin API port
`tee_user_port`	`9999`	TEE user/inference API port
`tee_ssh_key`	(empty)	SSH key for TEE host (enables getpwuid fix)
`vllm_image`	`vllm/vllm-openai:latest`	Docker image for vLLM
`vllm_port`	`8000`	Port shared sequentially by clear and encrypted instances
`vllm_gpu_memory_utilization`	`0.5`	GPU memory fraction for vLLM
`vllm_max_model_len`	`8192`	Maximum sequence length
`vllm_tensor_parallel_size`	`1`	Number of GPUs for tensor parallelism
`vllm_startup_timeout`	`300`	Seconds to wait for vLLM to become ready
`vllm_request_timeout`	`120`	Timeout per inference request
`temperature`	`0`	Inference temperature (0 = deterministic)
`num_prompts`	`10`	Number of benchmark prompts
`seed`	`42`	Random seed for prompt generation
`fhe_enc_overhead_threshold_ms`	`10.0`	SERV-7 encryption overhead threshold
`fhe_dec_overhead_threshold_ms`	`5.0`	SERV-7 decryption overhead threshold
`output_dir`	`./results`	Output directory for JSON results and HTML report
`customer_name`	(empty)	Customer name shown in the report header
`poc_id`	(empty)	POC identifier shown in the report header

Scalability Parameters

scalability:
  concurrent_levels: [1, 5, 10, 20, 50]
  context_lengths: [512, 1024, 2048, 4096, 8192]
  batch_sizes: [1, 4, 8, 16, 32]
  sustained_duration_hours: 24
  sustained_abbreviated_minutes: 3

Accuracy Benchmarks

accuracy_benchmarks:
  - "mmlu"
  - "hellaswag"
  - "gsm8k"
  - "humaneval"

Execution Flow

dk-test run
  │
  ├── Phase 1: Clear Model
  │     ├── Start vLLM container (clear model, port 8000)
  │     ├── Wait for /v1/models readiness
  │     ├── Run PERF / ACC / SCALE clear-side tests
  │     ├── Collect SERV clear-model reference probes
  │     ├── Save intermediate results
  │     └── Stop vLLM container
  │
  ├── Phase 2: Encrypted Model
  │     ├── Start vLLM container (encrypted model, port 8000)
  │     ├── Wait for /v1/models readiness
  │     ├── Initialize fhenomai CLI config
  │     ├── Start TEE serving (fhenomai serve start)
  │     ├── Verify model is ONLINE (fhenomai model list --show-status)
  │     ├── Run PERF / ACC / SCALE encrypted-side tests (via TEE)
  │     └── Save intermediate results
  │
  ├── Phase 3: Security Validation
  │     └── Run SEC-1 through SEC-6
  │
  ├── Phase 4: Serving Workflow
  │     └── Run SERV-1 through SERV-7 (encrypted probes + 3-way comparison)
  │
  ├── Phase 5: Training (Extended POC)
  │     └── Run T-1 through T-3 (if enabled)
  │
  ├── Teardown
  │     ├── Stop vLLM container
  │     ├── Stop TEE serving (fhenomai serve stop)
  │     └── Verify model is OFFLINE
  │
  └── Report Generation
        ├── Side-by-side comparison tables
        ├── Pass/fail badges per test
        ├── Per-prompt response viewer
        └── Chart.js performance visualizations

Environment Variables

Variable	Maps To
`DK_GPU_HOST`	`gpu_host`
`DK_TEE_HOST`	`tee_host`
`DK_MODEL_NAME`	`model_name`
`DK_OUTPUT_DIR`	`output_dir`

Dependencies

Core (installed automatically):

click · pyyaml · httpx · aiohttp · numpy · scipy · jinja2 · rich · paramiko · psutil · tenacity

Optional — accuracy (pip install dk-test-suite[accuracy]):

lm-eval · bert-score · deepeval

Optional — development (pip install dk-test-suite[dev]):

pytest · black · flake8 · mypy · isort

Runtime requirements:

Python ≥ 3.10
Docker (for vLLM container management)
NVIDIA GPU with CUDA drivers (for inference)
Network access to the TEE machine (HTTPS)
fhenomai CLI installed in a Python venv on the GPU machine

Output

Each run produces:

File	Description
`results/poc_report_YYYYMMDD_HHMMSS.html`	Self-contained HTML report
`results/performance_results.json`	PERF-1 through PERF-5 raw data
`results/accuracy_results.json`	ACC-1 through ACC-5 raw data
`results/scalability_results.json`	SCALE-1 through SCALE-4 raw data
`results/security_results.json`	SEC-1 through SEC-6 raw data
`results/serving_results.json`	SERV-1 through SERV-7 raw data
`results/training_results.json`	T-1 through T-3 raw data
`results/prompt_set.json`	Deterministic prompt set used for the run

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.3

Apr 16, 2026

This version

1.0.2

Apr 16, 2026

1.0.0

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dk_test_suite-1.0.2.tar.gz (121.3 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dk_test_suite-1.0.2-py3-none-any.whl (94.6 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file dk_test_suite-1.0.2.tar.gz.

File metadata

Download URL: dk_test_suite-1.0.2.tar.gz
Upload date: Apr 16, 2026
Size: 121.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dk_test_suite-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`29c90241b93036b710b104ceb33ad57fa9fe44176d6e2fb16ebebc695641926d`
MD5	`b5292803cf42d1999e0714545dc21806`
BLAKE2b-256	`5b9fee4b6161e883cf30686a00ab7711d6ed6378885a17c6ed4815333e891fa6`

See more details on using hashes here.

File details

Details for the file dk_test_suite-1.0.2-py3-none-any.whl.

File metadata

Download URL: dk_test_suite-1.0.2-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 94.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dk_test_suite-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4bc086c97be9f215df90651f242d06089ed3afffe645d53946d902839d1b87f8`
MD5	`9dfc42b19cbebe96dd9ef4c2c38fca5a`
BLAKE2b-256	`26fd99b26f317ec984caae7233dafd2a89bb8e3152539950271e96477d781e52`

See more details on using hashes here.

dk-test-suite 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dk-test-suite

Features

Installation

Quick Start

1. Create a Configuration File

2. Run the Full Suite

3. View the Report

CLI Reference

dk-test run

dk-test scale4-sustained

dk-test report

dk-test info

dk-test --version

Test Categories

Performance (PERF-1 — PERF-5)

Accuracy (ACC-1 — ACC-5)

Scalability (SCALE-1 — SCALE-4)

Security (SEC-1 — SEC-6)

Serving (SERV-1 — SERV-7)

Training (T-1 — T-3)

Configuration Reference

Required Parameters

Remote Mode Parameters

Optional Parameters

Scalability Parameters

Accuracy Benchmarks

Execution Flow

Environment Variables

Dependencies

Output

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`dk-test run`

`dk-test scale4-sustained`

`dk-test report`

`dk-test info`

`dk-test --version`