Python SDK for Synthetic Data REST API - remote synthetic data generation
Project description
Synthetic Data SDK
Python client library for the Synthetic Data REST API. Generate high-fidelity synthetic data remotely without running heavy ML models locally.
Requirements
- Running Synthetic Data Server (for API endpoint)
Installation
# Using pip
pip install -e .
# Using uv, by adding dependency
uv sync
Features
- Remote synthesis: No local GPU or heavy dependencies required
- Multiple models: VineCopula, TabularGAN, TabDiff, SMOTE
- Multi-table support: Preserve foreign key relationships
- Privacy evaluation: Built-in privacy attack simulations (MIA, SARP, linkage attacks)
- Causal fidelity evaluation: Validate treatment effects, decision consistency, fairness preservation
- Pandas integration: Works seamlessly with DataFrames
- Quality evaluation: Built-in metrics for fidelity assessment (KS test, correlation, TSTR/TRTR)
- Model inspection: Access model metadata and configuration via
summary() - Unified API: Consistent interface across all synthesizers
Quick Start
Synthetic Data Generation
from synthetic_data_sdk import RemoteVineCopula
import pandas as pd
# Load your data
data = pd.read_csv("customers.csv")
# Initialize remote synthesizer
synth = RemoteVineCopula(
endpoint="http://localhost:8000",
model_version="customer-synth-v1"
)
# Fit model (runs on remote server)
synth.fit(data)
# Generate synthetic data
synthetic_data = synth.transform(n=1000)
# Inspect model configuration
summary = synth.summary()
print(f"Fitted: {summary['fitted']}")
print(f"Continuous cols: {summary['n_continuous']}")
# Evaluate synthetic data quality
metrics = synth.evaluate(
real_data=data,
synthetic_data=synthetic_data,
categorical_cols=['city'],
target_col='churn',
task_type='classification'
)
print(f"Correlation Error: {metrics['correlation_error']:.3f}")
print(f"TSTR Score: {metrics['tstr_score']:.3f}")
Privacy Risk Evaluation
from synthetic_data_sdk import PrivacyEvaluator
import pandas as pd
# Load datasets
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
synthetic_data = pd.read_csv("synthetic.csv")
# Initialize privacy evaluator
privacy = PrivacyEvaluator(endpoint="http://localhost:8000")
# Run privacy attack simulations
results = privacy.evaluate(
train_real_data=train_data,
test_real_data=test_data,
synthetic_data=synthetic_data,
sensitive_columns=['ssn', 'salary', 'diagnosis']
)
# Check results
print(f"Overall Privacy Risk: {results['overall_risk']}")
print(f"Successful Attacks: {results['summary']['successful_attacks']}/{results['summary']['total_attacks']}")
# Review individual attacks
for attack in results['attacks']:
print(f"{attack['attack_type']}: {attack['risk_level']} risk")
Causal Fidelity Evaluation
from synthetic_data_sdk import CausalEvaluator
import pandas as pd
# Load datasets
real_data = pd.read_csv("real.csv")
synthetic_data = pd.read_csv("synthetic.csv")
# Initialize causal evaluator
causal = CausalEvaluator(endpoint="http://localhost:8000")
# Evaluate treatment effect preservation
results = causal.evaluate(
real_data=real_data,
synthetic_data=synthetic_data,
treatment_col='treatment_assigned',
outcome_col='outcome_value',
covariates=['age', 'income'],
evaluation_type='treatment_effect'
)
# Check results
print(f"Overall Preserved: {results['overall_preserved']}")
print(f"ATE Real: {results['evaluations'][0]['metrics']['ate_real']:.3f}")
print(f"ATE Synthetic: {results['evaluations'][0]['metrics']['ate_synth']:.3f}")
print(f"Relative Error: {results['evaluations'][0]['metrics']['ate_relative_error']:.1%}")
Available Synthesizers
| Class | Model Type | Best For | Methods |
|---|---|---|---|
RemoteVineCopula |
Vine Copula | General tabular data | fit, transform, fit_transform, summary, evaluate |
RemoteMultiTableVineCopula |
Multi-table Copula | Relational databases | fit, transform, fit_transform, summary, evaluate, validate_relationships, relational_score, get_table_order |
RemoteTabularGAN |
GAN | Complex distributions | fit, transform, fit_transform, summary, evaluate |
RemoteTabDiff |
Diffusion Model | High-fidelity synthesis | fit, transform, fit_transform, summary, evaluate |
RemoteSMOTE |
Oversampling | Imbalanced datasets | fit, transform, fit_transform, summary, evaluate |
PrivacyEvaluator |
Privacy Attacks | Privacy risk assessment | evaluate |
CausalEvaluator |
Causal Fidelity | Treatment effects, fairness | evaluate |
CertificationClient |
Quality Grading | A+ to F scoring | certify |
SynthesisClient |
Low-level HTTP client | Direct API interaction | request, get_job_status |
Documentation
API reference is available via docstrings, also refer to Online Documentation
Support
- Issues: Report bugs and request features via issue tracker
- Email: info@protegrity.com
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protegrity_synthetic_data_sdk-2.0.0-py3-none-any.whl.
File metadata
- Download URL: protegrity_synthetic_data_sdk-2.0.0-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee8f158d1f40da2d8940c88c0ffbdd1ac511c587072fecaf93652e98341965e6
|
|
| MD5 |
0c0a4e34ead63eeb3b3e21a8311597a3
|
|
| BLAKE2b-256 |
b14774dbb3ac7e0a6793a5fd62d9c0a146ebb43285b8ea73ec3f3b33314986fc
|