Synthetic Data Generator for Machine Learning Pipelines
Project description
Synthetic Generator
A comprehensive Python library for generating synthetic data with various distributions, correlations, and constraints for machine learning and data science applications.
๐ Table of Contents
- Synthetic Generator
- ๐ Table of Contents
- ๐ Features
- ๐ฏ Why Synthetic Generator?
- ๐ Quick Start
- ๐ Detailed Documentation
- ๐ฏ Use Cases
- ๐ง Advanced Features
- ๐ Available Templates
- ๐ฆ Package Information
- ๐ ๏ธ Development
- ๐ค Contributing
- ๐ License
- ๐ Getting Started
- ๐ Contact
- ๐ Acknowledgments
๐ Features
Core Data Generation
- Multiple Distributions: Normal, Uniform, Exponential, Gamma, Beta, Weibull, Poisson, Binomial, Geometric, Categorical
- Data Types: Integer, Float, String, Boolean, Date, DateTime, Email, Phone, Address, Name
- Correlations: Define relationships between variables with correlation matrices
- Constraints: Value ranges, uniqueness, null probabilities, pattern matching
- Dependencies: Generate data based on other columns with conditional rules
Main Features
- Schema Inference: Automatically detect data types and constraints from existing data (no distribution inference)
- Templates: Pre-built schemas for common use cases (customer data, medical data, e-commerce, financial)
- Privacy: Basic anonymization support
- Validation: Comprehensive data validation against schemas (data types and constraints only)
- Export: Multiple format support (CSV, JSON, Parquet, Excel)
User Experience
- Easy-to-Use API: Simple, intuitive interface for data generation
- Web Interface: Modern, responsive web UI for interactive data generation
- Flexible Configuration: Support for both programmatic and configuration-based setup
- Reproducibility: Seed-based random generation for consistent results
- Performance: Optimized for large-scale data generation
๐ฏ Why Synthetic Generator?
Synthetic Generator is designed to make synthetic data generation simple, flexible, and powerful. Whether you're:
- Testing applications with realistic data
- Training machine learning models with diverse datasets
- Prototyping without sensitive information
- Data augmentation for research purposes
This library provides all the tools you need to create high-quality synthetic data that maintains the statistical properties of your original data while ensuring privacy and flexibility.
๐ Quick Start
Installation
# Install from PyPI (Recommended)
pip install synthetic-generator
# Install from GitHub (Development)
git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic-generator
pip install -e .
Quick Generate (CLI)
# From a built-in template
synthetic-generator generate --template customer_data --rows 10000 --out customers.parquet
# From your real data (fit then sample)
synthetic-generator generate --in real.csv --rows 5000 --out synthetic.csv
Quick API (Python)
from synthetic_generator.quick import dataset, fit
import pandas as pd
# 1) From a template
df = dataset(template="customer_data", rows=1000, seed=42)
# 2) From your data (fit then sample)
# Create sample data or load from file
sample_data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 70000, 80000, 90000]
})
model = fit(sample_data)
df2 = model.sample(500, seed=123)
Using Templates
from synthetic_generator import load_template, generate_data
# Load a pre-built template
schema = load_template("customer_data")
# Generate data
data = generate_data(schema, n_samples=500, seed=123)
print(data.head())
Schema Inference
import pandas as pd
from synthetic_generator import infer_schema, generate_data
# Create sample data (or load from file)
existing_data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 70000, 80000, 90000],
'department': ['IT', 'HR', 'Sales', 'IT', 'HR']
})
# Infer schema
schema = infer_schema(existing_data)
# Generate new data based on inferred schema
new_data = generate_data(schema, n_samples=1000, seed=456)
๐ Detailed Documentation
Data Types
Synthetic Generator supports various data types:
- Numeric:
INTEGER,FLOAT - Text:
STRING,EMAIL,PHONE,ADDRESS,NAME - Categorical:
CATEGORICAL,BOOLEAN - Temporal:
DATE,DATETIME
Distributions
Available statistical distributions:
- Continuous:
NORMAL,UNIFORM,EXPONENTIAL,GAMMA,BETA,WEIBULL - Discrete:
POISSON,BINOMIAL,GEOMETRIC - Categorical:
CATEGORICAL,CONSTANT
Correlations
Define relationships between variables:
schema = DataSchema(
columns=[...],
correlations={
"height": {"weight": 0.7}, # Height and weight correlation
"age": {"income": 0.4} # Age and income correlation
}
)
Constraints
Apply various constraints to your data:
ColumnSchema(
name="salary",
data_type=DataType.FLOAT,
distribution=DistributionType.NORMAL,
parameters={"mean": 50000, "std": 15000},
min_value=30000, # Minimum value
max_value=100000, # Maximum value
unique=True, # Unique values
nullable=True, # Allow null values
null_probability=0.05 # 5% null probability
)
Dependencies
Generate data based on other columns:
ColumnSchema(
name="bonus",
data_type=DataType.FLOAT,
distribution=DistributionType.UNIFORM,
parameters={"low": 0, "high": 10000},
depends_on=["salary"],
conditional_rules={
"rules": [
{
"condition": {"salary": {"operator": ">", "value": 70000}},
"value": 5000
}
],
"default": 1000
}
)
๐ฏ Use Cases
Customer Data
Generate realistic customer profiles with demographics, contact information, and preferences.
Medical Data
Create synthetic patient data with health metrics, demographics, and medical conditions.
Financial Data
Generate transaction data with realistic amounts, categories, and temporal patterns.
E-commerce Data
Create order and product data with realistic relationships and business rules.
๐ง Advanced Features
Optional Web Interface
You can install and run the web UI if needed:
pip install synthetic-generator[web]
synthetic-generator web # http://localhost:8000
Web UI tips (v0.0.7+):
- Templates: clicking "Use Template" navigates to the Generator and auto-populates columns and parameters.
- Export: after generating data, export directly from the Generator page via the built-in Export panel (CSV, JSON, Excel, Parquet). There is no separate Export page.
- Schema Inference: Only infers data types and constraints, not distributions. Users can manually specify distributions in the Generator.
- Null Probability: Fixed issue where 100% null probability wasn't being applied correctly.
- JSON Serialization: Fixed NaN values in generated data to properly serialize as null in JSON.
Data Generation
# Generate data with custom parameters
from synthetic_generator import load_template, generate_data
schema = load_template("customer_data")
data = generate_data(schema, n_samples=1000, seed=42)
Data Validation
from synthetic_generator import validate_data
# Validate generated data
results = validate_data(data, schema)
print(f"Valid: {results['valid']}")
print(f"Errors: {results['errors']}")
print(f"Warnings: {results['warnings']}")
Data Export
from synthetic_generator.export import export_data
# Export to various formats
export_data(data, 'csv', filepath='data.csv')
export_data(data, 'json', filepath='data.json')
export_data(data, 'excel', filepath='data.xlsx')
export_data(data, 'parquet', filepath='data.parquet')
๐ Available Templates
customer_data: Customer information with demographicsecommerce_data: E-commerce transaction datamedical_data: Medical patient data with health metricsfinancial_data: Financial transaction data
๐ฆ Package Information
- PyPI: https://pypi.org/project/synthetic-generator/
- Version: 0.0.8
- Python: 3.8+
- Dependencies: pandas, pydantic, numpy, scipy
๐ ๏ธ Development
Installation for Development
git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic_generator
make install_dev
Running Tests
make test
Running Examples
python examples/basic_usage.py
๐ค Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Setup
git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic_generator
make install_dev
๐ License
Synthetic Generator is released under the MIT License. See LICENSE.txt for details.
๐ Getting Started
For a quick start guide, see QUICKSTART.md.
For detailed examples, check the examples/ directory.
๐ Contact
Vo Hoang Nhat Khang
Maintainer & Developer
Synthetic Generator - Python Package
Contact via:
- Email: nhatkhangcs@gmail.com
- GitHub: nhatkhangcs
- PyPI: synthetic-generator
๐ Acknowledgments
Thanks to all contributors and the open-source community for making this project possible.
Happy coding with Synthetic Generator! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synthetic_generator-0.0.8.tar.gz.
File metadata
- Download URL: synthetic_generator-0.0.8.tar.gz
- Upload date:
- Size: 46.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfddbdd141429c0f9d932247113123f0deda00368ad19ab4b98b151c3e52d3cf
|
|
| MD5 |
cb104dfc9ad42ae79a0af10b27e7fdd6
|
|
| BLAKE2b-256 |
13fff124dadf651ba467b79cbd57768d1142ae046b84550c169e658da16c8d39
|
File details
Details for the file synthetic_generator-0.0.8-py3-none-any.whl.
File metadata
- Download URL: synthetic_generator-0.0.8-py3-none-any.whl
- Upload date:
- Size: 43.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7459a1454ec4791c019c7e2bc29a0766362bc9926f53982da9c54797b712aef
|
|
| MD5 |
4c8ef7f1f3c45d65ae15ebee34546a4a
|
|
| BLAKE2b-256 |
c2ad7dc964b6b55f9e341d0778fb3211a86141bb59f52613527387c90a9c9307
|