Open-source synthetic data generator with cryptographic proof receipts.
Project description
VeriSynth Core
VeriSynth Core is a lightweight, privacy-preserving synthetic data generation CLI. It transforms sensitive tabular datasets into statistically realistic synthetic data — with cryptographic proof receipts that verify integrity and reproducibility.
✨ Features
- 🔐 Privacy-safe synthesis — no real records are ever exposed
- 📊 Statistical realism using Gaussian Copula modeling
- 🗂️ Schema configuration — explicit field mapping and exclusion via YAML
- 🧾 Proof receipts (
proof.json) include hashes, Merkle roots, metrics & seed - 🧠 Deterministic generation via reproducible random seeds
- ⚡ Runs locally — no cloud or GPUs required
- 🧩 Extensible — drop-in engine architecture for future models (CTGAN, TVAE, etc.)
🧰 Quick Start
1. Create a virtual environment
python3 -m venv venv
source venv/bin/activate
2. Install dependencies
pip install -r requirements.txt
3. Run VeriSynth
python -m verisynth.cli --input data/sample_patients.csv --output out/ --rows 1000
This command:
-
Loads
data/sample_patients.csv -
Learns its structure and correlations
-
Generates 1,000 synthetic rows
-
Saves:
out/synthetic.csv→ synthetic datasetout/proof.json→ verifiable proof receipt
4. Optional: Use Schema Configuration
# Create example schema configuration
python -m verisynth.cli --create-schema-example config.yaml
# Run with schema (excludes patient_id, maps types)
python -m verisynth.cli --input data/sample_patients.csv --output out/ --schema config.yaml
🧪 Running Tests
VeriSynth includes a comprehensive test suite to ensure reliability and correctness.
Prerequisites
Make sure you have the development dependencies installed:
pip install pytest pytest-cov
Running Tests
# Run all tests
python -m pytest tests/ -v
# Run tests with coverage report
python -m pytest tests/ --cov=verisynth --cov-report=term-missing
# Run a specific test file
python -m pytest tests/test_cli.py -v
# Run tests in verbose mode with coverage
python -m pytest tests/ -v --cov=verisynth --cov-report=term-missing
Test Structure
The test suite includes:
test_cli.py- Tests the command-line interface functionalitytest_proof.py- Tests Merkle root consistency and proof generationtest_synth.py- Tests synthetic data generation and shape validationtest_schema.py- Tests schema configuration functionality (25 comprehensive test cases)
Schema Test Coverage
The test_schema.py file provides comprehensive testing for the schema feature:
- Configuration Tests: YAML file loading, validation, error handling
- Field Operations: Exclusion, type conversion (int, float, bool, str)
- CLI Integration: Schema file creation, command-line usage
- Synthesis Integration: Schema application during data generation
- Edge Cases: Empty dataframes, invalid configurations, missing fields
- Backward Compatibility: Ensures existing functionality still works
Continuous Integration
Tests are automatically run on every push and pull request via GitHub Actions, ensuring code quality and preventing regressions.
🧾 Example Proof Receipt
{
"verisynth_version": "core-0.1.0",
"license": "MIT",
"metrics": { "corr_mean_abs_delta": 0.12, "naive_reid_risk": 0.01 },
"input": { "rows": 10, "sha256": "…82b7" },
"output": { "rows": 1000000, "sha256": "…acb9" },
"model": { "engine": "GaussianCopula", "seed": 42 },
"proof": "merkle_root: …c31"
}
Each proof ensures integrity and reproducibility: same input + same seed = identical output and Merkle proof.
Verify Sample Proof
python verisynth/verify.py
⚙️ CLI Reference
python -m verisynth.cli --input <path> --output <dir> [--rows N] [--seed SEED] [--schema SCHEMA]
| Flag | Description |
|---|---|
--input |
Path to input CSV file |
--output |
Output directory for synthetic data and proof |
--rows |
Number of synthetic rows to generate (default: 1000) |
--seed |
Random seed for deterministic reproducibility |
--schema |
Path to YAML schema configuration file (optional) |
Examples:
# Basic synthesis
python -m verisynth.cli --input data/finance.csv --output out/ --rows 500000 --seed 1337
# With schema configuration
python -m verisynth.cli --input data/patients.csv --output out/ --schema config.yaml
# Create example schema configuration
python -m verisynth.cli --create-schema-example config.yaml
🗂️ Schema Configuration
VeriSynth supports explicit field mapping and exclusion through YAML schema configuration files. This gives you fine-grained control over which fields to synthesize and how to handle data types.
Schema Configuration Format
exclude: ["patient_id", "address"]
types:
age: int
bmi: float
smoker: bool
hba1c: float
model:
engine: GaussianCopula
seed: 42
Configuration Options
exclude: List of field names to exclude from synthesis (e.g., IDs, addresses)types: Explicit type mappings for fields (supports:int,float,bool,str)model: Model configuration including engine and seed
Benefits
- Privacy: Exclude sensitive identifiers and PII
- Control: Explicit type handling instead of automatic detection
- Reproducibility: Schema configuration is included in proof receipts
- Validation: Built-in validation ensures configuration correctness
🔬 What's Under the Hood
VeriSynth uses a Gaussian Copula model to learn the joint distribution of all numeric and categorical variables. Instead of randomizing data, it captures real-world correlations (e.g., age ↔ blood pressure) and samples new records consistent with the original dataset.
🔒 Proof System
Each run produces a verifiable audit trail:
- SHA-256 file hashes of input/output
- Merkle roots linking dataset lineage
- Model seed & parameters for deterministic replay
- Privacy & fidelity metrics
🧩 This system provides verifiable lineage and reproducibility — a foundation for future zero-knowledge (ZK) verification.
🧠 Roadmap
- Add differential privacy metrics (ε, δ)
- Add support for CTGAN / TVAE models
- Add signed receipts (Ed25519)
- Add proof viewer
📜 License
MIT © VeriSynth.ai
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file verisynth_core-0.1.0.tar.gz.
File metadata
- Download URL: verisynth_core-0.1.0.tar.gz
- Upload date:
- Size: 23.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
feab174a6327a33e54eca0749d7df5447863cf947ef19320572b8df60e09cfe5
|
|
| MD5 |
26fc871eb45057fabb1cadbd3ae96f49
|
|
| BLAKE2b-256 |
b53d1e5a91542a47f5be59b0dc90368af8a705ba7bd7fd49548b6644f3b3a1e1
|
File details
Details for the file verisynth_core-0.1.0-py3-none-any.whl.
File metadata
- Download URL: verisynth_core-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3ce7226ef55aa4bc8e80a8d8f9aed9934b108167796deb720c4ef5a93e10e95
|
|
| MD5 |
b1c3121fac15c0e3fb727ab62387fa0d
|
|
| BLAKE2b-256 |
bce417d3ed2f0036c73fa128eae405490f833faf0572a68c0fdcb69b9c63c7ed
|