Python wrapper for DataSynth synthetic data generation
Project description
datasynth-py
Python wrapper for the DataSynth synthetic data generator.
Installation
From PyPI
pip install datasynth-py[all]
Or install specific extras:
pip install datasynth-py # Core only (no dependencies)
pip install datasynth-py[cli] # CLI generation (PyYAML)
pip install datasynth-py[memory] # In-memory tables (pandas)
pip install datasynth-py[streaming] # Streaming (websockets)
pip install datasynth-py[all] # All optional dependencies
From Source
cd python
pip install -e ".[all]"
Quick Start
from datasynth_py import DataSynth, CompanyConfig, Config, GlobalSettings, ChartOfAccountsSettings
config = Config(
global_settings=GlobalSettings(
industry="retail",
start_date="2024-01-01",
period_months=12,
),
companies=[
CompanyConfig(code="C001", name="Retail Corp", currency="USD", country="US"),
],
chart_of_accounts=ChartOfAccountsSettings(complexity="small"),
)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
print(result.output_dir)
Using Blueprints
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
config = blueprints.retail_small(companies=4, transactions=10000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "parquet", "sink": "path", "path": "./output"})
Statistical Distributions (v0.3.0+)
from datasynth_py.config.models import (
Config,
AdvancedDistributionSettings,
MixtureDistributionConfig,
MixtureComponentConfig,
CorrelationConfig,
CorrelationFieldConfig,
RegimeChangeConfig,
EconomicCycleConfig,
StatisticalValidationConfig,
StatisticalTestConfig,
)
config = Config(
# ... other settings ...
# Advanced statistical distributions
distributions=AdvancedDistributionSettings(
enabled=True,
industry_profile="retail",
# Mixture model for transaction amounts
amounts=MixtureDistributionConfig(
enabled=True,
distribution_type="lognormal",
components=[
MixtureComponentConfig(weight=0.60, mu=6.0, sigma=1.5, label="routine"),
MixtureComponentConfig(weight=0.30, mu=8.5, sigma=1.0, label="significant"),
MixtureComponentConfig(weight=0.10, mu=11.0, sigma=0.8, label="major"),
],
benford_compliance=True,
),
# Cross-field correlations via copulas
correlations=CorrelationConfig(
enabled=True,
copula_type="gaussian", # gaussian, clayton, gumbel, frank, student_t
fields=[
CorrelationFieldConfig(name="amount", distribution_type="lognormal"),
CorrelationFieldConfig(name="line_items", distribution_type="normal", min_value=1, max_value=20),
],
matrix=[[1.0, 0.65], [0.65, 1.0]],
),
# Economic regime changes
regime_changes=RegimeChangeConfig(
enabled=True,
economic_cycle=EconomicCycleConfig(
enabled=True,
cycle_period_months=48,
amplitude=0.15,
recession_probability=0.1,
),
),
# Statistical validation tests
validation=StatisticalValidationConfig(
enabled=True,
tests=[
StatisticalTestConfig(test_type="benford_first_digit", threshold_mad=0.015),
StatisticalTestConfig(test_type="distribution_fit", target_distribution="lognormal", significance=0.05),
],
fail_on_violation=False,
),
),
)
Distribution Blueprints
from datasynth_py.config import blueprints
# ML training with realistic distributions
config = blueprints.ml_training(with_distributions=True)
# Statistical validation preset
config = blueprints.statistical_validation()
# Add distributions to any config
config = blueprints.with_distributions(base_config)
# Retail with realistic names
config = blueprints.retail_small(realistic_names=True)
Integration Features (v0.2.2+)
from datasynth_py import (
Config,
StreamingSettings,
RateLimitSettings,
TemporalAttributeSettings,
RelationshipSettings,
GraphExportSettings,
)
config = Config(
# ... other settings ...
# Streaming output with backpressure
streaming=StreamingSettings(
enabled=True,
buffer_size=1000,
backpressure="block", # block, drop_oldest, drop_newest, buffer
),
# Rate limiting for controlled throughput
rate_limit=RateLimitSettings(
enabled=True,
entities_per_second=10000.0,
burst_size=100,
),
# Bi-temporal data support
temporal_attributes=TemporalAttributeSettings(
enabled=True,
generate_version_chains=True,
avg_versions_per_entity=1.5,
),
# Relationship generation with cardinality rules
relationships=RelationshipSettings(
enabled=True,
allow_orphans=True,
orphan_probability=0.01,
),
# Graph export including RustGraph format
graph_export=GraphExportSettings(
enabled=True,
formats=["pytorch_geometric", "rustgraph"],
),
)
Requirements
The wrapper shells out to the datasynth-data CLI binary. Build it with:
cargo build --release
export DATASYNTH_BINARY=target/release/datasynth-data
Or pass binary_path when creating the client:
synth = DataSynth(binary_path="/path/to/datasynth-data")
Documentation
See the Python Wrapper Guide for complete documentation.
License
Apache 2.0 License - see the main project LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datasynth_py-1.0.0.tar.gz.
File metadata
- Download URL: datasynth_py-1.0.0.tar.gz
- Upload date:
- Size: 45.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca287047433cd2d2c3de52d10ad6277d927036e358101c15948d5edd3e14cd65
|
|
| MD5 |
0fbda36965561c887f74010dbae02df5
|
|
| BLAKE2b-256 |
8d03498a82387ad4edf4d0da301c6497614c3be4c41d73b2ca868b74be7efd3e
|
File details
Details for the file datasynth_py-1.0.0-py3-none-any.whl.
File metadata
- Download URL: datasynth_py-1.0.0-py3-none-any.whl
- Upload date:
- Size: 48.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b1149c9ce64627b622382b4160047b9f2379865884991e5bf84da27a1712066
|
|
| MD5 |
3c7b3ee11538c30e622e7e2ff4dc1486
|
|
| BLAKE2b-256 |
9fd7972fa8ab762c821f20b0128d7a10bbe47a81815a799931e6f2bda8610864
|