Skip to main content

Open-source synthetic data generator with cryptographic proof receipts.

Project description

VeriSynth Logo

VeriSynth Core

Tests

VeriSynth Core is a lightweight, privacy-preserving synthetic data generation CLI. It transforms sensitive tabular datasets into statistically realistic synthetic data — with cryptographic proof receipts that verify integrity and reproducibility.


✨ Features

  • 🔐 Privacy-safe synthesis — no real records are ever exposed
  • 📊 Statistical realism using Gaussian Copula modeling
  • 🗂️ Schema configuration — explicit field mapping and exclusion via YAML
  • 🧾 Proof receipts (proof.json) include hashes, Merkle roots, metrics & seed
  • 🧠 Deterministic generation via reproducible random seeds
  • Runs locally — no cloud or GPUs required
  • 🧩 Extensible — drop-in engine architecture for future models (CTGAN, TVAE, etc.)

🧰 Quick Start

1. Create a virtual environment

python3 -m venv venv
source venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

3. Run VeriSynth

python -m verisynth.cli --input data/sample_patients.csv --output out/ --rows 1000

This command:

  • Loads data/sample_patients.csv

  • Learns its structure and correlations

  • Generates 1,000 synthetic rows

  • Saves:

    • out/synthetic.csv → synthetic dataset
    • out/proof.json → verifiable proof receipt

4. Optional: Use Schema Configuration

# Create example schema configuration
python -m verisynth.cli --create-schema-example config.yaml

# Run with schema (excludes patient_id, maps types)
python -m verisynth.cli --input data/sample_patients.csv --output out/ --schema config.yaml

🧪 Running Tests

VeriSynth includes a comprehensive test suite to ensure reliability and correctness.

Prerequisites

Make sure you have the development dependencies installed:

pip install pytest pytest-cov

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run tests with coverage report
python -m pytest tests/ --cov=verisynth --cov-report=term-missing

# Run a specific test file
python -m pytest tests/test_cli.py -v

# Run tests in verbose mode with coverage
python -m pytest tests/ -v --cov=verisynth --cov-report=term-missing

Test Structure

The test suite includes:

  • test_cli.py - Tests the command-line interface functionality
  • test_proof.py - Tests Merkle root consistency and proof generation
  • test_synth.py - Tests synthetic data generation and shape validation
  • test_schema.py - Tests schema configuration functionality (25 comprehensive test cases)

Schema Test Coverage

The test_schema.py file provides comprehensive testing for the schema feature:

  • Configuration Tests: YAML file loading, validation, error handling
  • Field Operations: Exclusion, type conversion (int, float, bool, str)
  • CLI Integration: Schema file creation, command-line usage
  • Synthesis Integration: Schema application during data generation
  • Edge Cases: Empty dataframes, invalid configurations, missing fields
  • Backward Compatibility: Ensures existing functionality still works

Continuous Integration

Tests are automatically run on every push and pull request via GitHub Actions, ensuring code quality and preventing regressions.


🧾 Example Proof Receipt

{
  "verisynth_version": "core-0.1.0",
  "license": "MIT",
  "metrics": { "corr_mean_abs_delta": 0.12, "naive_reid_risk": 0.01 },
  "input":  { "rows": 10, "sha256": "…82b7" },
  "output": { "rows": 1000000, "sha256": "…acb9" },
  "model":  { "engine": "GaussianCopula", "seed": 42 },
  "proof":  "merkle_root: …c31"
}

Each proof ensures integrity and reproducibility: same input + same seed = identical output and Merkle proof.

Verify Sample Proof

python verisynth/verify.py

⚙️ CLI Reference

python -m verisynth.cli --input <path> --output <dir> [--rows N] [--seed SEED] [--schema SCHEMA]
Flag Description
--input Path to input CSV file
--output Output directory for synthetic data and proof
--rows Number of synthetic rows to generate (default: 1000)
--seed Random seed for deterministic reproducibility
--schema Path to YAML schema configuration file (optional)

Examples:

# Basic synthesis
python -m verisynth.cli --input data/finance.csv --output out/ --rows 500000 --seed 1337

# With schema configuration
python -m verisynth.cli --input data/patients.csv --output out/ --schema config.yaml

# Create example schema configuration
python -m verisynth.cli --create-schema-example config.yaml

🗂️ Schema Configuration

VeriSynth supports explicit field mapping and exclusion through YAML schema configuration files. This gives you fine-grained control over which fields to synthesize and how to handle data types.

Schema Configuration Format

exclude: ["patient_id", "address"]
types:
  age: int
  bmi: float
  smoker: bool
  hba1c: float
model:
  engine: GaussianCopula
  seed: 42

Configuration Options

  • exclude: List of field names to exclude from synthesis (e.g., IDs, addresses)
  • types: Explicit type mappings for fields (supports: int, float, bool, str)
  • model: Model configuration including engine and seed

Benefits

  • Privacy: Exclude sensitive identifiers and PII
  • Control: Explicit type handling instead of automatic detection
  • Reproducibility: Schema configuration is included in proof receipts
  • Validation: Built-in validation ensures configuration correctness

🔬 What's Under the Hood

VeriSynth uses a Gaussian Copula model to learn the joint distribution of all numeric and categorical variables. Instead of randomizing data, it captures real-world correlations (e.g., age ↔ blood pressure) and samples new records consistent with the original dataset.


🔒 Proof System

Each run produces a verifiable audit trail:

  • SHA-256 file hashes of input/output
  • Merkle roots linking dataset lineage
  • Model seed & parameters for deterministic replay
  • Privacy & fidelity metrics

🧩 This system provides verifiable lineage and reproducibility — a foundation for future zero-knowledge (ZK) verification.


🧠 Roadmap

  • Add differential privacy metrics (ε, δ)
  • Add support for CTGAN / TVAE models
  • Add signed receipts (Ed25519)
  • Add proof viewer

📜 License

MIT © VeriSynth.ai


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verisynth_core-0.1.0.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verisynth_core-0.1.0-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file verisynth_core-0.1.0.tar.gz.

File metadata

  • Download URL: verisynth_core-0.1.0.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for verisynth_core-0.1.0.tar.gz
Algorithm Hash digest
SHA256 feab174a6327a33e54eca0749d7df5447863cf947ef19320572b8df60e09cfe5
MD5 26fc871eb45057fabb1cadbd3ae96f49
BLAKE2b-256 b53d1e5a91542a47f5be59b0dc90368af8a705ba7bd7fd49548b6644f3b3a1e1

See more details on using hashes here.

File details

Details for the file verisynth_core-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: verisynth_core-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for verisynth_core-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b3ce7226ef55aa4bc8e80a8d8f9aed9934b108167796deb720c4ef5a93e10e95
MD5 b1c3121fac15c0e3fb727ab62387fa0d
BLAKE2b-256 bce417d3ed2f0036c73fa128eae405490f833faf0572a68c0fdcb69b9c63c7ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page