Generate multi-table synthetic datasets with behavioral trajectories, correlations, and causal lags. Config-driven, deterministic, no real data required.
Project description
plotsim
Generate multi-table synthetic datasets with behavioral trajectories, correlations, and causal lags. Config-driven. No real data required.
Most synthetic data tools generate columns independently. A customer's revenue is random. Their engagement is random. Their churn is random. The numbers fill a schema, but they don't behave like real data — because in real data, these things move together.
plotsim generates multi-table relational datasets where every metric tells the same story. Each simulated entity follows a behavioral trajectory — a mathematical curve that evolves over time. Revenue, engagement, churn risk, and support tickets all derive from the same trajectory position. When engagement rises, revenue follows. When it declines, churn events fire. Across every table, every foreign key, every time period.
The result is synthetic test data with shape — not just structure.
pip install plotsim
Quick start
Generate a synthetic dataset from a bundled template:
plotsim template saas -o config.yaml
plotsim run config.yaml -o ./output --validate
Or from Python:
from plotsim import load_config, generate_tables, write_tables
config = load_config("config.yaml")
tables = generate_tables(config)
write_tables(tables, config)
A single config produces a complete star schema:
output/
├── dim_date.csv # complete date spine
├── dim_company.csv # entity attributes
├── dim_user.csv # sub-entity attributes
├── dim_plan.csv # reference lookup
├── fct_engagement.csv # entity × period metrics
├── fct_revenue.csv # entity × period metrics
├── fct_support_tickets.csv # entity × period metrics
├── evt_login.csv # behavioral events
├── evt_churn.csv # threshold-triggered events
└── validation_report.txt # integrity checks
If a company's engagement trajectory declines, its login events decrease in evt_login.csv and churn events appear in evt_churn.csv — because both tables read from the same underlying trajectory, not from separate random generators.
What makes plotsim different
Trajectory-driven generation. Each entity is assigned an archetype — a curve built from segments like sigmoid, exponential decay, plateau, or oscillation. At every time step, the engine reads the entity's position on that curve (a value between 0 and 1) and derives all metrics from it. Positive-polarity metrics rise when the trajectory rises. Negative-polarity metrics fall.
Cross-metric correlation. Configure the correlation strength between any pair of metrics. plotsim uses a Gaussian copula to inject the configured correlation, regardless of the underlying distribution pairing. Set engagement and revenue to covary at r=0.8, while support tickets moves inversely at r=-0.5 — and observe those values in the output within a measured tolerance (±0.10 for most distribution pairings; see statistical fidelity for the per-pair numbers).
Causal lag with composable chains. One metric can trail another by N periods, blended at a configurable weight against the metric's own trajectory. Configure engagement to drive revenue with a 3-period lag and the engine implements that shift faithfully at the metric-generator level; small lags (1–2) are also recoverable in output-level cross-correlation, while larger lags on smooth-archetype drivers require non-cross-correlation detection methods (see statistical fidelity). Lags compose through chains: if A drives B with lag 2 and B drives C with lag 3, C reflects A's signal at lag 5.
Star schema output. plotsim generates dimensional models — date dimensions, entity dimensions, fact tables, event tables — with referential integrity enforced. Every foreign key resolves. Zero orphans.
Deterministic output. Same config + same seed = byte-identical CSVs. Always. The seeded numpy random state flows through every layer of generation.
Config-time validation. Pydantic V2 cross-validates your entire config before generation starts. Circular causal dependencies, non-positive-semi-definite correlation matrices, broken FK references, empty entity lists — all caught at parse time with clear error messages.
Six distribution families. Normal, lognormal, beta, poisson, gamma, and weibull — each configurable per metric. The engine samples from the distribution you specify and preserves marginal fidelity through the correlation injection.
When to use plotsim
Analytics portfolios — showcase dbt models, dashboards, or SQL analysis with data that has real temporal patterns and cross-metric relationships, not random noise.
Data engineering pipelines — test with relational input where referential integrity holds, metrics correlate, and temporal ordering is causal. No production data access needed.
Dashboard prototyping — build with synthetic data that trends, correlates, and responds to filters the way real data would, before production access exists.
Data science practice — explore datasets with known ground truth. The correlations, trajectories, and causal lags are all configured — so you can verify whether your analysis recovers them.
ML training data — generate labeled datasets with controlled statistical properties for classification, regression, or causal discovery benchmarks.
Teaching and courses — give students multi-table schemas that behave like production data, where joins reveal actual business patterns.
Templates
Five domain configs ship with the package:
| Template | Domain | Entities | Tables |
|---|---|---|---|
saas |
B2B SaaS | accounts with users | 10 |
hr |
HR analytics | employees in departments | 7 |
ecommerce |
E-commerce | customer segments | 8 |
education |
University | student cohorts | 7 |
healthcare |
Clinic | patient groups | 8 |
plotsim list-templates # see all available
plotsim template hr -o hr.yaml # export one to edit
Each template is a YAML file you can modify. Or describe what you need to any LLM:
"Change this SaaS config to model a food delivery service with restaurants, orders, delivery times, and customer ratings."
How it works
plotsim's generation pipeline:
- Config — YAML defines entity types, metrics, distributions, archetypes, tables, correlations, causal lags, and noise levels. Pydantic V2 validates everything at load time.
- Trajectories — each entity is assigned an archetype curve. The trajectory engine computes a position between 0 and 1 for every time period.
- Metrics — processed in causal-dependency order (topologically sorted). Each metric's distribution is sampled at the trajectory-derived center. Causal lags propagate through the dependency chain.
- Correlations — a Gaussian copula transforms independent samples through CDF → standard normal space, applies the Cholesky factor, and inverse-transforms back. Configured correlations are preserved regardless of distribution pairing.
- Noise — Gaussian noise, outliers, and missing-completely-at-random nulls are injected after correlation, so they don't contaminate the configured statistical properties.
- Tables — dimension, fact, and event tables are assembled with enforced referential integrity. Output is deterministic CSV.
Config overview
A plotsim config has these sections:
- domain — name and entity label
- time_window — start, end, granularity (monthly / weekly / daily)
- seed — integer controlling all randomness
- metrics — name, distribution (normal, lognormal, beta, poisson, gamma, weibull), polarity, optional causal lag with configurable blend weight
- archetypes — named trajectory shapes built from curve segments (sigmoid, decay, step, plateau, oscillation, compound, linear)
- entities — groups assigned to archetype distributions
- tables — dim / fact / event schemas with typed columns and FK references
- correlations — metric-pair coefficients delivered via Gaussian copula
- noise — gaussian sigma, outlier rate, MCAR rate, temporal jitter
- stages — optional lifecycle sequence with enforceable ordering
Full schema with type annotations: plotsim/config.py
CLI reference
plotsim run <config> Generate dataset from config
-o, --output-dir <path> Output directory (default: from config)
-s, --seed <int> Override seed
-v, --validate Run validation after generation
--strict Fail on validation warnings
-q, --quiet Suppress output
plotsim validate <config> Check config without generating
plotsim info <config> Preview tables, rows, entities
plotsim list-templates Show bundled templates
plotsim template <name> Print template YAML to stdout
-o, --output <path> Write to file instead
Post-generation validation
The engine runs these checks after generation:
- FK integrity — every foreign key resolves to a parent row (0 orphans across all templates)
- PK uniqueness — no duplicate primary keys
- Date spine — no gaps or duplicates in the date dimension
- Causal coherence — lagged metrics inflect after their drivers
- Null policy — no unexpected nulls outside configured MCAR rates
- Correlation PSD — correlation matrix is positive semi-definite (checked at config load)
plotsim run config.yaml --validate
For the empirical bounds these guarantees hold within — measured per-pair correlation tolerance, the recoverable-lag boundary, the trajectory-first cell-level envelope, and the determinism contract — see docs/statistical-fidelity.md. The smoke test tests/test_fidelity_smoke.py re-checks the headline tolerances on every CI run.
Ecosystem positioning
plotsim sits between tools like Faker / plaitpy (random values from templates) and SDV (machine learning from real data).
Unlike Faker: plotsim produces multi-table relational datasets with cross-metric correlations, causal lags, and temporal trajectories — not independent random columns.
Unlike SDV: plotsim doesn't need real data. You specify the statistical properties you want in a YAML config, and the engine generates data matching that specification. No training, no privacy concerns, no seed data required.
Generated data and PII
plotsim uses Faker for string-valued columns (names, companies, emails). Faker output is realistic-looking but not globally unique — a generated name can coincidentally match a real person. Treat Faker output as synthetic, not anonymized.
Mark a column with pii_note: "<description>" in your config to flag it as producing realistic-sounding data about people or organizations. plotsim threads the note through schema introspection so downstream consumers (data catalogs, governance tools, documentation generators) can identify those fields. It is metadata only and does not change generation behavior.
Contributing
See CONTRIBUTING.md for dev setup, test commands, and how to add templates or curve types.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file plotsim-0.5.0.tar.gz.
File metadata
- Download URL: plotsim-0.5.0.tar.gz
- Upload date:
- Size: 234.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68224ffc2a3bff9ec623af590e59a4426be4bc3f39642109598ac0d12460c232
|
|
| MD5 |
b39ab321ed4d61f31cc1905bc1e44ebe
|
|
| BLAKE2b-256 |
be29b0866bde795da2074b3fd98c9af661fd5bab72b533397c340747294b6f4d
|
File details
Details for the file plotsim-0.5.0-py3-none-any.whl.
File metadata
- Download URL: plotsim-0.5.0-py3-none-any.whl
- Upload date:
- Size: 98.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acb8032848d6f2425625dd0449e8ae0caaae1ad90f649f772491e03dc4e3c432
|
|
| MD5 |
cf807b7474b45eccf1d9918351ad70bc
|
|
| BLAKE2b-256 |
d7051ca3a940b63debf42d0e8771be32d5bf59232c9fbd73da8163e907f97027
|