Research-grade synthetic data for finance & econometrics

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Nitya-hapani

These details have not been verified by PyPI

Project description

syndatakit

A synthetic data generator for finance & econometrics.

pip install syndatakit
syndatakit generate fred_macro --rows 1000000 --output training_data.csv

One command. One million synthetic macroeconomic observations. Statistically realistic. No real individuals. No data use agreement.

What syndatakit is

syndatakit is a synthetic data generator — not a collection of stored datasets.

When you run syndatakit generate hmda --rows 1000000, it creates one million brand new synthetic mortgage application records that never existed. They are statistically similar to real HMDA data but contain zero real people. The package does not store or ship those rows — it generates them on demand, every time, in seconds.

The package ships with 10 built-in dataset profiles — statistical models learned from published government aggregate statistics (CFPB reports, Federal Reserve bulletins, BLS data releases, etc.). Each profile captures the distributions, correlations, and structure of a real public dataset without containing any actual records from it.

You can also bring your own data:

import pandas as pd
from syndatakit.generators import GaussianCopulaGenerator

# Load your own real data
real = pd.read_csv("my_loan_book.csv")

# Fit the generator on it
gen = GaussianCopulaGenerator()
gen.fit(real)

# Generate unlimited synthetic versions
synthetic = gen.sample(100_000)
synthetic.to_csv("synthetic_loan_book.csv", index=False)

A bank feeds in their actual loan portfolio, fits the generator, and produces synthetic training data that matches their specific book — not a generic average. The 10 built-in profiles are the out-of-the-box experience so you can start immediately without any data of your own.

Why syndatakit

Training ML models in finance requires realistic data. But real data is locked behind NDAs, data use agreements, and privacy regulations.

syndatakit learns the statistical structure of financial datasets — distributions, correlations, temporal dynamics, stylized facts — and generates unlimited synthetic records that preserve that structure without exposing any real individuals.

No data use agreements. No PII. No legal review.

Install

pip install syndatakit                # core — Copula + VAR + Panel generators
pip install syndatakit[api]           # + Flask REST API server
pip install syndatakit[io]            # + Parquet, Arrow, Stata, SAS, Excel
pip install syndatakit[deep]          # + CTGAN deep generator (requires PyTorch)
pip install syndatakit[all]           # everything

Requires Python 3.9+. Core dependencies: pandas, numpy, scipy. No GPU required.

Quick start

# See all 10 built-in dataset profiles
syndatakit list

# Generate 5,000 synthetic mortgage applications
syndatakit generate hmda --rows 5000 --output mortgages.csv

# Generate with filters — high-DTI denied applicants in CA/TX
syndatakit generate hmda \
  --rows 1000 \
  --filter state:CA,TX \
  --filter dti_min:45 \
  --filter action_taken:2,3,7 \
  --output highrisk.csv

# Generate macro data under a recession scenario
syndatakit generate fred_macro \
  --rows 2000 \
  --scenario recession \
  --intensity 0.8 \
  --output recession_macro.csv

# Full fidelity report — how well does the synthetic data match the real distribution?
syndatakit evaluate real.csv synthetic.csv --type time_series

# Full privacy audit — is there any risk of re-identification?
syndatakit audit real.csv synthetic.csv

# Validate your own data before fitting
syndatakit validate my_data.csv

# Start REST API
syndatakit serve --port 8080

Python API

from syndatakit.generators import GaussianCopulaGenerator
from syndatakit.generators.time_series import VARGenerator
from syndatakit.catalog import load_seed
from syndatakit.fidelity import fidelity_report
from syndatakit.privacy import privacy_audit
from syndatakit.calibration import apply_scenario

# ── Using a built-in profile ──────────────────────────────────────────────────

gen = GaussianCopulaGenerator()
gen.fit(load_seed("hmda"))

df = gen.sample(10_000)
df_highrisk = gen.sample(
    1000,
    filters={"state": ["CA", "TX"], "dti_min": 45, "action_taken": ["2", "3"]},
)

# ── Using your own data ───────────────────────────────────────────────────────

import pandas as pd
real = pd.read_csv("my_data.csv")

gen = GaussianCopulaGenerator()
gen.fit(real)
synthetic = gen.sample(100_000)

# ── Time series ───────────────────────────────────────────────────────────────

gen_ts = VARGenerator(lags=2, time_col="year")
gen_ts.fit(load_seed("fred_macro"))
df_macro = gen_ts.sample(500)

# Conditioned on recession scenario
df_recession = apply_scenario(df_macro, "recession", intensity=0.9)

# ── Fidelity report ───────────────────────────────────────────────────────────

real = load_seed("fred_macro")
report = fidelity_report(real, df_macro, dataset_type="time_series")
print(f"Overall fidelity: {report['summary']['overall_fidelity']}%")
print(f"Stationarity:     {report['temporal']['stationarity']['_summary']['agreement_rate']}%")
print(f"Cointegration:    {report['temporal']['cointegration']['_summary']['agreement_rate']}%")

# ── Privacy audit ─────────────────────────────────────────────────────────────

audit = privacy_audit(real, df_macro, n_attacks=500)
print(f"Risk level:       {audit['verdict']['overall_risk']}")
print(f"Recommendation:   {audit['verdict']['recommendation']}")

Built-in dataset profiles

These 10 profiles are included so you can generate synthetic data immediately, without any source data of your own. Each profile was built from published government aggregate statistics — not from individual records.

ID	Name	Vertical	Columns	Source
`hmda`	HMDA Mortgage Applications	Credit & Lending	7	CFPB 2022
`fdic`	FDIC Bank Call Reports	Credit & Lending	12	FDIC SDI 2023
`credit_risk`	Consumer Credit Risk	Credit & Lending	10	CFPB derived
`edgar`	SEC EDGAR Financial Statements	Capital Markets	13	SEC XBRL 2023
`cftc`	CFTC Commitments of Traders	Capital Markets	10	CFTC COT 2023
`fred_macro`	FRED Macroeconomic Indicators	Macro & Central Bank	15	Federal Reserve
`bls`	BLS Employment & Wages	Macro & Central Bank	9	BLS QCEW 2022
`world_bank`	World Bank Development Indicators	Macro & Central Bank	12	WDI 2022
`irs_soi`	IRS Statistics of Income	Tax & Income	11	IRS SOI 2021
`census_acs`	Census ACS Income & Housing	Tax & Income	11	Census ACS 2022

You are not limited to these 10. Any tabular dataset can be used with gen.fit(your_df).

Generators

Generator	Best for	Auto-selected for
`GaussianCopulaGenerator`	Cross-sectional tabular data	hmda, fdic, credit_risk, edgar, cftc, irs_soi, census_acs
`VARGenerator`	Multivariate time series	fred_macro, bls
`FixedEffectsGenerator`	Panel data (entity × time)	world_bank, fdic
`CTGANGenerator`	Complex non-linear relationships	any (`pip install syndatakit[deep]`)

The CLI auto-selects the right generator for each built-in profile. When using your own data, pick the one that matches your data type.

Fidelity metrics

Every generated dataset can be evaluated across six dimensions:

Marginal — per-column distributional similarity (KS test for numeric, TVD for categorical)

Joint — Spearman correlation matrix distance between real and synthetic

Temporal — stationarity (ADF), cointegration (Engle-Granger), structural breaks (Chow scan), Granger causality

Stylized facts — fat tails, skewness sign, autocorrelation structure, ARCH effects

Downstream (TSTR) — train a model on synthetic data, test on real held-out data, compare to training on real data

Privacy — exact copy check, membership inference, singling-out risk, linkability risk

Privacy

from syndatakit.privacy import privacy_audit, format_audit
from syndatakit.privacy.dp import PrivacyBudget, laplace_mechanism

# Full privacy audit
audit = privacy_audit(real_df, synthetic_df, n_attacks=500)
print(format_audit(audit))

# Differential privacy — add calibrated noise to statistics
budget = PrivacyBudget(epsilon=1.0)
noisy_mean = laplace_mechanism(
    real_df["loan_amount"].mean(),
    sensitivity=1_000_000,
    epsilon=0.5,
    budget=budget,
)

Calibration

from syndatakit.calibration import apply_scenario, match_moments
from syndatakit.calibration.priors import get_priors

# 5 built-in economic scenarios
df_recession      = apply_scenario(df, "recession",        intensity=0.9)
df_severe         = apply_scenario(df, "severe_recession",  intensity=1.0)
df_rate_shock     = apply_scenario(df, "rate_shock",        intensity=1.0)
df_credit_crisis  = apply_scenario(df, "credit_crisis",     intensity=0.8)
df_expansion      = apply_scenario(df, "expansion",         intensity=0.5)

# Post-hoc moment calibration
calibrated = match_moments(real_df, synthetic_df)

# Bayesian priors — stabilises generators on small datasets (< 500 rows)
priors = get_priors("hmda")
gen = GaussianCopulaGenerator(priors=priors)
gen.fit(small_df)   # works well even with 50 rows

REST API

syndatakit serve --port 8080
# Interactive docs: http://localhost:8080/docs

import requests

# Generate synthetic data
r = requests.post("http://localhost:8080/generate", json={
    "dataset":   "fred_macro",
    "rows":      2000,
    "scenario":  "recession",
    "intensity": 0.9,
})
df = pd.DataFrame(r.json()["data"])

# Fidelity evaluation
r = requests.post("http://localhost:8080/evaluate",
    files={"real": open("real.csv"), "synthetic": open("syn.csv")},
    data={"type": "time_series"},
)

# Privacy audit
r = requests.post("http://localhost:8080/audit",
    files={"real": open("real.csv"), "synthetic": open("syn.csv")},
    data={"attacks": "500"},
)

All endpoints: GET /datasets · GET /datasets/{id} · GET /datasets/{id}/sample · POST /generate · POST /evaluate · POST /audit · GET /scenarios · POST /scenario/apply · POST /validate · GET /health

IO formats

from syndatakit.io import write, read

write(df, "output.csv")        # CSV
write(df, "output.parquet")    # Parquet  (pip install syndatakit[io])
write(df, "output.dta")        # Stata
write(df, "output.feather")    # Apache Arrow
write(df, "output.xlsx")       # Excel
write(df, "output.json")       # JSON

df = read("my_data.dta")       # reads Stata, SAS, CSV, Parquet, Arrow, JSON, Excel

Architecture

syndatakit/
├── generators/
│   ├── base.py                  BaseGenerator ABC — fit(), sample(), fit_sample()
│   ├── cross_sectional/         GaussianCopulaGenerator
│   ├── time_series/             VARGenerator
│   ├── panel/                   FixedEffectsGenerator
│   └── deep/                    CTGANGenerator (pip install syndatakit[deep])
├── fidelity/
│   ├── marginal.py              KS + TVD per-column scores
│   ├── joint.py                 Correlation matrix distance
│   ├── temporal/                Stationarity, cointegration, breaks, causality
│   ├── stylized_facts.py        Fat tails, ARCH, autocorrelation
│   ├── downstream.py            TSTR train/test evaluation
│   └── report.py                Unified report assembler
├── calibration/
│   ├── moment_matching.py       Post-hoc moment calibration
│   ├── priors.py                Bayesian priors + MAP blending
│   └── scenario.py              Economic scenario calibration
├── privacy/
│   ├── dp.py                    Laplace + Gaussian mechanisms, budget tracker
│   ├── disclosure.py            Membership inference attack
│   ├── singling_out.py          Quasi-identifier singling-out attack
│   ├── linkability.py           Nearest-neighbour linkability attack
│   └── audit.py                 Full privacy audit — runs all four tests
├── catalog/
│   └── loader.py                10 built-in dataset profiles + seed builders
└── io/
    ├── formats.py               CSV, Parquet, Arrow, Stata, SAS, Excel, JSON
    └── validators.py            Schema validation before fitting

Contributing

Adding a new dataset profile — implement _build_<id>() in catalog/loader.py using published aggregate statistics (not individual records), add a PriorSet in calibration/priors.py, and add tests. See CONTRIBUTING.md for the full walkthrough.

Adding a new generator — subclass BaseGenerator, implement fit() and sample(), export from the relevant __init__.py. The interface is minimal by design.

License

Business Source License 1.1 (BSL-1.1). Free to use for any purpose including commercial, except offering syndatakit itself as a hosted managed service without a commercial agreement. Self-hosting always free.

Cloud version

Need higher row counts, managed infrastructure, compliance documentation, or team access? A hosted cloud version is coming. Join the waitlist →

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Nitya-hapani

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.1

Mar 14, 2026

2.0.0

Mar 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syndatakit-2.0.1.tar.gz (92.4 kB view details)

Uploaded Mar 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

syndatakit-2.0.1-py3-none-any.whl (98.2 kB view details)

Uploaded Mar 14, 2026 Python 3

File details

Details for the file syndatakit-2.0.1.tar.gz.

File metadata

Download URL: syndatakit-2.0.1.tar.gz
Upload date: Mar 14, 2026
Size: 92.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for syndatakit-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ccd57968d717a6716ec18157dfcff5b38f03abf4c2415a930e9fec75c8169184`
MD5	`4970cfb45f6dded549e909b3bca39a79`
BLAKE2b-256	`15b9211f370235130cf1eb4aa2340e172606ab10e4ede00751f8795a128d286a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for syndatakit-2.0.1.tar.gz:

Publisher: publish.yml on Nityahapani/syndatakit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: syndatakit-2.0.1.tar.gz
- Subject digest: ccd57968d717a6716ec18157dfcff5b38f03abf4c2415a930e9fec75c8169184
- Sigstore transparency entry: 1102547260
- Sigstore integration time: Mar 14, 2026
Source repository:
- Permalink: Nityahapani/syndatakit@7665398c911ee3082404a8a436ecb81e0774f0f5
- Branch / Tag: refs/tags/v2.0.1
- Owner: https://github.com/Nityahapani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7665398c911ee3082404a8a436ecb81e0774f0f5
- Trigger Event: push

File details

Details for the file syndatakit-2.0.1-py3-none-any.whl.

File metadata

Download URL: syndatakit-2.0.1-py3-none-any.whl
Upload date: Mar 14, 2026
Size: 98.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for syndatakit-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d413e6532c4cfc94019803821dbae1b681dce305428411251ffe660311439d67`
MD5	`e818411487449203a5f2f06a70c0ba6c`
BLAKE2b-256	`89d19143332c60c5ccff14fa4676b4ce98c18fd67492aa40dcdf00366b18425a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for syndatakit-2.0.1-py3-none-any.whl:

Publisher: publish.yml on Nityahapani/syndatakit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: syndatakit-2.0.1-py3-none-any.whl
- Subject digest: d413e6532c4cfc94019803821dbae1b681dce305428411251ffe660311439d67
- Sigstore transparency entry: 1102547281
- Sigstore integration time: Mar 14, 2026
Source repository:
- Permalink: Nityahapani/syndatakit@7665398c911ee3082404a8a436ecb81e0774f0f5
- Branch / Tag: refs/tags/v2.0.1
- Owner: https://github.com/Nityahapani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7665398c911ee3082404a8a436ecb81e0774f0f5
- Trigger Event: push

syndatakit 2.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

syndatakit

What syndatakit is

Why syndatakit

Install

Quick start

Python API

Built-in dataset profiles

Generators

Fidelity metrics

Privacy

Calibration

REST API

IO formats

Architecture

Contributing

License

Cloud version

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance