CALM-Data-Generator: A Python library for synthetic data generation with support for drift injection and clinical data.
Project description
CALM-Data-Generator
🚀 Now available on PyPI! Install with:
pip install calm-data-generator
CALM-Data-Generator is a comprehensive Python library for synthetic data generation with advanced features for:
- Clinical/Medical Data - Generate realistic patient demographics, genes, proteins
- Tabular Data Synthesis - CTGAN, TVAE, CART, and more
- Time Series - TimeGAN, DGAN
- Single-Cell - scVI, GEARS (Perturbation Prediction)
- Drift Injection - Test ML model robustness with controlled drift
- Privacy Assessment - DCR metrics for re-identification risk
- Scenario Evolution - Feature evolution and target construction
Scope & Capabilities
Calm-Data-Generator is optimized for structured tabular data. It is designed to handle:
- ✅ Classification (Binary & Multi-class)
- ✅ Regression (Continuous variables)
- ✅ Multi-label problems
- ✅ Clustering (Preserving natural groupings)
- ✅ Time Series (Temporal correlations and patterns)
- ✅ Single-Cell / Genomics (scRNA-seq expression data)
[!IMPORTANT] This library is NOT designed for unstructured data such as Images, Videos, or Audio. It does not include Computer Vision or Signal Processing models.
What Makes This Library Unique?
CALM-Data-Generator is not just another synthetic data tool—it's a unified ecosystem that brings together the best open-source libraries under a single, consistent API:
🔗 Unified Multi-Library Integration
Instead of learning and managing multiple complex libraries separately, CALM-Data-Generator provides:
- One API for 15+ synthesis methods from different sources (Synthcity, scvi-tools, GEARS, imbalanced-learn, etc.)
- Seamless interoperability between tabular, time-series, streaming, and genomic data generators
- Consistent configuration across all methods with automatic parameter validation
- Integrated reporting with YData Profiling for all generation methods
🌊 Advanced Drift Injection (Industry-Leading)
The DriftInjector module is one of the most comprehensive drift simulation tools available:
- 14+ drift types: Feature drift (gradual, abrupt, incremental, recurrent), label drift, concept drift, correlation drift, outlier injection, and more
- Correlation-aware drift: Propagate realistic drift across correlated features (e.g., increase income → increase spending)
- Multi-modal drift profiles: Sigmoid, linear, cosine transitions for gradual drift
- Conditional drift: Apply drift only to specific data subsets based on business rules
- Integrated with generators: Inject drift directly during synthesis or post-hoc on existing data
- Perfect for MLOps: Test data drift monitoring, concept drift detection, and model robustness before production
In summary: While other tools focus on a single approach (e.g., just GANs, just statistical methods), CALM-Data-Generator unifies the ecosystem and adds production-grade drift simulation that most libraries don't offer.
Core Technologies
This library leverages and unifies best-in-class open-source tools to provide a seamless data generation experience:
- Synthcity: The core engine for tabular deep learning models (CTGAN, TVAE) and privacy. Included by default.
- River: Powers the streaming generation capabilities (
[stream]extra).. - YData Profiling: Generates comprehensive automated quality reports.
⚡ Presets (Templates)
Calm-Data-Generator includes a set of Presets designed to give you a head start. These are pre-configured settings for common use cases.
[!TIP] Presets are Baselines: They are intended to be used as a starting point. You should import a preset, modify it to fit your specific data needs (e.g., change columns, tweak thresholds), and then pass it to the generator.
Available Presets
Located in calm_data_generator/presets/, easy to import and iterate upon:
FastPrototypePreset: Optimized for speed (fewer epochs, simple models) to test pipelines quickly.HighFidelityPreset: Tuned for maximum quality (CTGAN/TVAE with more training) for production data.ClinicalDataGenerator(and presets): Specialized for healthcare data (demographics + omics).ImbalancePreset: Configured to handle and rebalance highly skewed datasets.TimeSeriesPreset: setup for sequential data generation.
Example Usage:
from calm_data_generator.presets import FastPrototypePreset
from calm_data_generator import RealGenerator
# 1. Load the preset config
config = FastPrototypePreset
# 2. Modify it for your specific case
config["epochs"] = 50 # Overwrite default
# 3. Use it
gen = RealGenerator()
gen.generate(data, **config)
Key Libraries & Ecosystem
| Library | Role | Usage in Calm-Data-Generator |
|---|---|---|
| Synthcity | Deep Learning Engine | Powers CTGAN, TVAE, DDPM, TimeGAN. Handling privacy & fidelity. |
| scvi-tools | Single-Cell Analysis | Powers scvi method for high-dimensional genomic/transcriptomic data. |
| GEARS | Graph Perturbation | Powers gears method for predicting single-cell perturbation effects. |
| River | Streaming ML | Powers StreamGenerator for concept drift simulation and real-time data flow. |
| YData Profiling | Reporting | Generates automated quality reports (QualityReporter). |
| Pydantic | Validation | Ensures strict type checking and configuration management. |
| PyTorch | Backend | Underlying tensor computation for all deep learning models. |
| Copulae | Statistical Modeling | Powers the copula method for multivariate dependence modeling. |
Safe Data Sharing
A key advantage of Calm-Data-Generator is enabling the use of private data in public or collaborative environments:
- Private Origin: You start with sensitive data (e.g., GDPR/HIPAA restricted) that cannot leave your secure environment.
- Synthetic Twin: The library generates a synthetic dataset that statistically mirrors the original but contains no real individuals.
- Safe Distribution: Once validated (using
QualityReporter's privacy checks), this synthetic dataset allows for risk-free sharing, model training, and testing without exposing confidential information.
Key Use Cases
- MLOps Monitoring Validation: Use StreamGenerator and DriftInjector to simulate data drift (gradual, abrupt) and verify if your monitoring alerts trigger correctly before deployment.
- Biomedical Research (HealthTech): Generate synthetic patient cohorts with ClinicalDataGenerator that preserve complex biological correlations (e.g., gene-age relationships) for collaborative studies without compromising patient privacy.
- Stress Testing ("What-If" Analysis): Use ScenarioInjector to simulate future scenarios (e.g., "What if the customer age base increases by 10 years?") and measure model performance degradation under stress.
- Correlation-Aware Drift: Inject drift that realistically propagates to correlated features (e.g., increasing income also proportionally increases spending) using the
correlations=Trueparameter. - Development Data: Provide developers with high-fidelity synthetic replicas of production databases, allowing them to build and test features safely without accessing sensitive real-world data.
Architecture & Design
Technical Architecture
Minimalist view of the system's core components and data flow.
Installation
[!WARNING] Heads Up: This library depends on heavy Deep Learning frameworks like
PyTorch,Synthcity, andCUDAlibraries. The installation might be heavy (~2-3 GB) and take a few minutes depending on your internet connection. We strongly recommend using a fresh virtual environment.
Standard Installation
The library is available on PyPI. For the most stable experience, we recommend using a virtual environment:
# 1. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
# 2. Upgrade pip, setuptools and wheel (Crucial for successful installation)
pip install --upgrade pip setuptools wheel
# 3. Install the core library (optimized for speed)
pip install calm-data-generator
Installation Extras
Depending on your needs, you can install additional capabilities:
# For Stream Generator (River)
pip install "calm-data-generator[stream]"
# Full suite
pip install "calm-data-generator[full]"
[!NOTE] Performance Note: We have optimized the dependency tree in version 1.0.0 by pinning specific versions of
pydantic,xgboost, andcloudpickle. This significantly reduces the initial installation time from ~40 minutes to just a few minutes. 🚀
From source:
git clone https://github.com/AlejandroBeldaFernandez/Calm-Data_Generator.git
cd Calm-Data_Generator
pip install .
Troubleshooting
Zsh shell (macOS/Linux): If brackets cause errors, use quotes:
pip install "calm-data-generator[stream]"
River compilation errors (Linux/macOS):
# Ubuntu/Debian
sudo apt install build-essential python3-dev
# macOS
xcode-select --install
# Then retry
pip install calm-data-generator
Windows users: Install Visual Studio Build Tools first:
- Download Visual Studio Build Tools
- Install "Desktop development with C++"
- Then retry installation
PyTorch CPU-only (no GPU):
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install calm-data-generator
Dependency conflicts: Use a clean virtual environment:
python -m venv venv
source venv/bin/activate # Linux/macOS
# or: venv\Scripts\activate # Windows
pip install calm-data-generator
Quick Start
Generate Synthetic Data from Real Dataset
from calm_data_generator import RealGenerator
import pandas as pd
# Your real dataset (can be a DataFrame, path to .csv, .h5, or .h5ad)
data = pd.read_csv("your_data.csv") # or "your_data.h5ad"
# Initialize generator
gen = RealGenerator()
# Generate 1000 synthetic samples using CTGAN
synthetic = gen.generate(
data=data,
n_samples=1000,
method='ctgan',
target_col='label',
epochs=300,
batch_size=500,
discriminator_steps=1,
)
print(f"Generated {len(synthetic)} samples")
GPU Acceleration
Methods with GPU support:
| Method | GPU Support | Parameter |
|---|---|---|
ctgan, tvae |
✅ CUDA/MPS | enable_gpu=True |
diffusion |
✅ PyTorch | Auto-detected |
ddpm |
✅ PyTorch + Synthcity | Auto-detected |
timegan |
✅ PyTorch + Synthcity | Auto-detected |
timevae |
✅ PyTorch + Synthcity | Auto-detected |
| smote, adasyn, cart, rf, lgbm, gmm, copula | ❌ CPU only | - |
synthetic = gen.generate(
data=data,
n_samples=1000,
method='ctgan',
epochs=300,
enable_gpu=True,
)
Note: Requires PyTorch with CUDA support:
pip install torch --index-url https://download.pytorch.org/whl/cu118
Generate Clinical Data
from calm_data_generator import ClinicalDataGenerator
from calm_data_generator.generators.configs import DateConfig
gen = ClinicalDataGenerator()
# Generate patient data with genes and proteins
result = gen.generate(
n_samples=100,
n_genes=500,
n_proteins=200,
date_config=DateConfig(start_date="2024-01-01")
)
demographics = result['demographics']
genes = result['genes']
proteins = result['proteins']
Inject Drift for ML Testing
Option 1: Directly from generate() (recommended)
from calm_data_generator import RealGenerator
gen = RealGenerator()
# Generate synthetic data WITH drift in one call
synthetic = gen.generate(
data=real_data,
n_samples=1000,
method='ctgan',
target_col='label',
drift_injection_config=[
{
"method": "inject_drift",
"params": {
"columns": ["age", "income", "label"],
"drift_mode": "gradual", # Auto-detects column types
"drift_magnitude": 0.3,
"center": 500,
"width": 200
}
}
]
)
Option 2: Standalone DriftInjector
from calm_data_generator import DriftInjector
injector = DriftInjector()
# Unified drift injection (auto-detects types)
drifted_data = injector.inject_drift(
df=data,
columns=['feature1', 'feature2', 'status'],
drift_mode='gradual',
drift_magnitude=0.5,
# Optional specific configs
numeric_operation='shift',
categorical_operation='frequency',
boolean_operation='flip'
)
**Available drift methods:** `inject_feature_drift`, `inject_feature_drift_gradual`, `inject_feature_drift_incremental`, `inject_feature_drift_recurrent`, `inject_label_drift`, `inject_concept_drift`, `inject_categorical_frequency_drift`, and more. See [DRIFT_INJECTOR_REFERENCE.md](calm_data_generator/docs/DRIFT_INJECTOR_REFERENCE.md).
### Single-Cell / Gene Expression Data
Generate synthetic single-cell RNA-seq-like data using specialized VAE models:
```python
from calm_data_generator import RealGenerator
gen = RealGenerator()
# scVI: Generate new cells from scratch
synthetic = gen.generate(
data="expression_data.h5ad", # Paths to .h5 or .h5ad are supported directly
n_samples=1000,
method='scvi',
target_col='cell_type',
epochs=100,
n_latent=10,
)
| Method | Use Case |
|---|---|
scvi |
Generate new cells from learned distribution |
Stream Data Generation
from calm_data_generator import StreamGenerator
stream_gen = StreamGenerator()
# Generate a data stream with Concept Drift
stream_data = stream_gen.generate(
n_chunks=10,
chunk_size=1000,
concept_drift=True, # Simulate concept drift over time
n_features=10
)
print(f"Generated stream with {len(stream_data)} total samples")
Quality Reporting
from calm_data_generator import QualityReporter
# Generate a quality report comparing real vs synthetic data
reporter = QualityReporter()
reporter.generate_report(
real_data=data,
synthetic_data=synthetic,
output_dir="./quality_report",
target_col="target"
)
# Report saved to ./quality_report/report.html
Modules
| Module | Import | Description |
|---|---|---|
| Tabular | generators.tabular |
RealGenerator, QualityReporter |
| Clinical | generators.clinical |
ClinicalDataGenerator, ClinicalDataGeneratorBlock |
| Stream | `generators.stream` | StreamGenerator, StreamBlockGenerator |
| Blocks | generators.tabular |
RealBlockGenerator |
| Drift | generators.drift |
DriftInjector |
| Dynamics | generators.dynamics |
ScenarioInjector |
| Reports | reports |
Visualizer |
Synthesis Methods
| Method | Type | Description | Requirements / Notes |
|---|---|---|---|
cart |
ML | CART-based iterative synthesis (fast) | Base installation |
rf |
ML | Random Forest synthesis | Base installation |
lgbm |
ML | LightGBM-based synthesis | Base installation (Requires lightgbm) |
ctgan |
DL | Conditional GAN for tabular data | Requires synthcity |
tvae |
DL | Variational Autoencoder | Requires synthcity |
diffusion |
DL | Tabular Diffusion (custom, fast) | Base installation (PyTorch) |
ddpm |
DL | Synthcity TabDDPM (advanced) | Requires synthcity |
timegan |
Time Series | TimeGAN for sequential data | Requires synthcity |
timevae |
Time Series | TimeVAE for sequential data | Requires synthcity |
smote |
Augmentation | SMOTE oversampling | Base installation |
adasyn |
Augmentation | ADASYN adaptive sampling | Base installation |
| copula | Copula | Copula-based synthesis | Base installation |
| gmm | Statistical | Gaussian Mixture Models | Base installation |
| scvi | Single-Cell | scVI (Variational Inference) for RNA-seq | Requires scvi-tools |
CLI Access
# List all tutorials
calm-data-generator tutorials
# Show a specific tutorial
calm-data-generator tutorials show 1
# Run a tutorial
calm-data-generator tutorials run 1
# Show version
calm-data-generator version
Tutorials
| # | Tutorial | Description |
|---|---|---|
| 1 | Real Generator | Tabular data synthesis |
| 2 | Clinical Generator | Clinical/medical data |
| 3 | Drift Injector | Drift injection for ML |
| 4 | Stream Generator | Stream-based generation |
| 5 | Scenario Injector | Feature evolution |
Documentation Index
Explore the full documentation in the calm_data_generator/docs/ directory:
| Document | Description |
|---|---|
| DOCUMENTATION.md | Main User Guide. Comprehensive manual covering all modules, concepts, and advanced usage. |
| REAL_GENERATOR_REFERENCE.md | API Reference for RealGenerator. Detailed parameters for all synthesis methods (ctgan, lgbm, scvi, etc.). |
| DRIFT_INJECTOR_REFERENCE.md | API Reference for DriftInjector. Guide to using inject_drift and specialized drift capabilities. |
| STREAM_GENERATOR_REFERENCE.md | API Reference for StreamGenerator. Details on stream simulation and drift integration. |
| CLINICAL_GENERATOR_REFERENCE.md | API Reference for ClinicalGenerator. Configuration for genes, proteins, and patient data. |
| API.md | Technical API Index. High-level index of classes and functions. |
License
MIT License - see LICENSE file
Acknowledgements & Credits
We stand on the shoulders of giants. This library is possible thanks to these amazing open-source projects:
- Synthcity (Apache 2.0) - The engine behind our deep learning models.
- River (BSD-3-Clause) - Powering our streaming capabilities.
- YData Profiling (MIT) - Providing comprehensive data reporting.
- scvi-tools (BSD-3-Clause) - Enabling single-cell analysis.
- GEARS (MIT) - Supporting graph-based perturbation prediction.
- Imbalanced-learn (MIT) - Providing SMOTE and ADASYN implementations.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file calm_data_generator-1.0.0.tar.gz.
File metadata
- Download URL: calm_data_generator-1.0.0.tar.gz
- Upload date:
- Size: 216.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdab17e70f1b7d0e37b834d4d6ad12ab22b86fe1f057409f73c51309d75d04ec
|
|
| MD5 |
3e0ac9dd8e539e88775e57052de2cb81
|
|
| BLAKE2b-256 |
50d5e58fbfe19009e1bac1d2aaec77d110fe35715b0b7a2dbec87a9a95c77470
|
File details
Details for the file calm_data_generator-1.0.0-py3-none-any.whl.
File metadata
- Download URL: calm_data_generator-1.0.0-py3-none-any.whl
- Upload date:
- Size: 267.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
286f78695b18661a88158d1a012f6aa2f79ca915e020e01bcf2e1f81f0682321
|
|
| MD5 |
8eabddbf5e0c6fdf2dfcf8b8001f6d38
|
|
| BLAKE2b-256 |
66500fac0f320d49f1d3c04b7557736e141282516939e462de660dc61fc50c15
|