Skip to main content

Python SDK for Synthetic Data REST API - remote synthetic data generation

Project description

Synthetic Data SDK

Python client library for the Synthetic Data REST API. Generate high-fidelity synthetic data remotely without running heavy ML models locally.

Requirements

  • Running Synthetic Data Server (for API endpoint)

Installation

# Using pip
pip install -e .

# Using uv, by adding dependency
uv sync

Features

  • Remote synthesis: No local GPU or heavy dependencies required
  • Multiple models: VineCopula, TabularGAN, TabDiff, SMOTE
  • Multi-table support: Preserve foreign key relationships
  • Privacy evaluation: Built-in privacy attack simulations (MIA, SARP, linkage attacks)
  • Causal fidelity evaluation: Validate treatment effects, decision consistency, fairness preservation
  • Pandas integration: Works seamlessly with DataFrames
  • Quality evaluation: Built-in metrics for fidelity assessment (KS test, correlation, TSTR/TRTR)
  • Model inspection: Access model metadata and configuration via summary()
  • Unified API: Consistent interface across all synthesizers

Quick Start

Synthetic Data Generation

from synthetic_data_sdk import RemoteVineCopula
import pandas as pd

# Load your data
data = pd.read_csv("customers.csv")

# Initialize remote synthesizer
synth = RemoteVineCopula(
    endpoint="http://localhost:8000",
    model_version="customer-synth-v1"
)

# Fit model (runs on remote server)
synth.fit(data)

# Generate synthetic data
synthetic_data = synth.transform(n=1000)

# Inspect model configuration
summary = synth.summary()
print(f"Fitted: {summary['fitted']}")
print(f"Continuous cols: {summary['n_continuous']}")

# Evaluate synthetic data quality
metrics = synth.evaluate(
    real_data=data,
    synthetic_data=synthetic_data,
    categorical_cols=['city'],
    target_col='churn',
    task_type='classification'
)
print(f"Correlation Error: {metrics['correlation_error']:.3f}")
print(f"TSTR Score: {metrics['tstr_score']:.3f}")

Privacy Risk Evaluation

from synthetic_data_sdk import PrivacyEvaluator
import pandas as pd

# Load datasets
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
synthetic_data = pd.read_csv("synthetic.csv")

# Initialize privacy evaluator
privacy = PrivacyEvaluator(endpoint="http://localhost:8000")

# Run privacy attack simulations
results = privacy.evaluate(
    train_real_data=train_data,
    test_real_data=test_data,
    synthetic_data=synthetic_data,
    sensitive_columns=['ssn', 'salary', 'diagnosis']
)

# Check results
print(f"Overall Privacy Risk: {results['overall_risk']}")
print(f"Successful Attacks: {results['summary']['successful_attacks']}/{results['summary']['total_attacks']}")

# Review individual attacks
for attack in results['attacks']:
    print(f"{attack['attack_type']}: {attack['risk_level']} risk")

Causal Fidelity Evaluation

from synthetic_data_sdk import CausalEvaluator
import pandas as pd

# Load datasets
real_data = pd.read_csv("real.csv")
synthetic_data = pd.read_csv("synthetic.csv")

# Initialize causal evaluator
causal = CausalEvaluator(endpoint="http://localhost:8000")

# Evaluate treatment effect preservation
results = causal.evaluate(
    real_data=real_data,
    synthetic_data=synthetic_data,
    treatment_col='treatment_assigned',
    outcome_col='outcome_value',
    covariates=['age', 'income'],
    evaluation_type='treatment_effect'
)

# Check results
print(f"Overall Preserved: {results['overall_preserved']}")
print(f"ATE Real: {results['evaluations'][0]['metrics']['ate_real']:.3f}")
print(f"ATE Synthetic: {results['evaluations'][0]['metrics']['ate_synth']:.3f}")
print(f"Relative Error: {results['evaluations'][0]['metrics']['ate_relative_error']:.1%}")

Available Synthesizers

Class Model Type Best For Methods
RemoteVineCopula Vine Copula General tabular data fit, transform, fit_transform, summary, evaluate
RemoteMultiTableVineCopula Multi-table Copula Relational databases fit, transform, fit_transform, summary, evaluate, validate_relationships, relational_score, get_table_order
RemoteTabularGAN GAN Complex distributions fit, transform, fit_transform, summary, evaluate
RemoteTabDiff Diffusion Model High-fidelity synthesis fit, transform, fit_transform, summary, evaluate
RemoteSMOTE Oversampling Imbalanced datasets fit, transform, fit_transform, summary, evaluate
PrivacyEvaluator Privacy Attacks Privacy risk assessment evaluate
CausalEvaluator Causal Fidelity Treatment effects, fairness evaluate
CertificationClient Quality Grading A+ to F scoring certify
SynthesisClient Low-level HTTP client Direct API interaction request, get_job_status

Documentation

API reference is available via docstrings, also refer to Online Documentation

Support

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protegrity_synthetic_data_sdk-2.0.0-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file protegrity_synthetic_data_sdk-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for protegrity_synthetic_data_sdk-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee8f158d1f40da2d8940c88c0ffbdd1ac511c587072fecaf93652e98341965e6
MD5 0c0a4e34ead63eeb3b3e21a8311597a3
BLAKE2b-256 b14774dbb3ac7e0a6793a5fd62d9c0a146ebb43285b8ea73ec3f3b33314986fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page