Skip to main content

Synthetic Data Generator for Machine Learning Pipelines

Project description

Synthetic Generator

A comprehensive Python library for generating synthetic data with various distributions, correlations, and constraints for machine learning and data science applications.

PyPI version Python 3.8+ License: MIT

๐Ÿ“‹ Table of Contents

๐ŸŒŸ Features

Core Data Generation

  • Multiple Distributions: Normal, Uniform, Exponential, Gamma, Beta, Weibull, Poisson, Binomial, Geometric, Categorical
  • Data Types: Integer, Float, String, Boolean, Date, DateTime, Email, Phone, Address, Name
  • Correlations: Define relationships between variables with correlation matrices
  • Constraints: Value ranges, uniqueness, null probabilities, pattern matching
  • Dependencies: Generate data based on other columns with conditional rules

Main Features

  • Schema Inference: Automatically detect data types and constraints from existing data (no distribution inference)
  • Templates: Pre-built schemas for common use cases (customer data, medical data, e-commerce, financial)
  • Privacy: Basic anonymization support
  • Validation: Comprehensive data validation against schemas (data types and constraints only)
  • Export: Multiple format support (CSV, JSON, Parquet, Excel)

User Experience

  • Easy-to-Use API: Simple, intuitive interface for data generation
  • Web Interface: Modern, responsive web UI for interactive data generation
  • Flexible Configuration: Support for both programmatic and configuration-based setup
  • Reproducibility: Seed-based random generation for consistent results
  • Performance: Optimized for large-scale data generation

๐ŸŽฏ Why Synthetic Generator?

Synthetic Generator is designed to make synthetic data generation simple, flexible, and powerful. Whether you're:

  • Testing applications with realistic data
  • Training machine learning models with diverse datasets
  • Prototyping without sensitive information
  • Data augmentation for research purposes

This library provides all the tools you need to create high-quality synthetic data that maintains the statistical properties of your original data while ensuring privacy and flexibility.

๐Ÿš€ Quick Start

Installation

# Install from PyPI (Recommended)
pip install synthetic-generator

# Install from GitHub (Development)
git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic-generator
pip install -e .

Quick Generate (CLI)

# From a built-in template
synthetic-generator generate --template customer_data --rows 10000 --out customers.parquet

# From your real data (fit then sample)
synthetic-generator generate --in real.csv --rows 5000 --out synthetic.csv

Quick API (Python)

from synthetic_generator.quick import dataset, fit
import pandas as pd

# 1) From a template
df = dataset(template="customer_data", rows=1000, seed=42)

# 2) From your data (fit then sample)
# Create sample data or load from file
sample_data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 70000, 80000, 90000]
})
model = fit(sample_data)
df2 = model.sample(500, seed=123)

Using Templates

from synthetic_generator import load_template, generate_data

# Load a pre-built template
schema = load_template("customer_data")

# Generate data
data = generate_data(schema, n_samples=500, seed=123)
print(data.head())

Schema Inference

import pandas as pd
from synthetic_generator import infer_schema, generate_data

# Create sample data (or load from file)
existing_data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 70000, 80000, 90000],
    'department': ['IT', 'HR', 'Sales', 'IT', 'HR']
})

# Infer schema
schema = infer_schema(existing_data)

# Generate new data based on inferred schema
new_data = generate_data(schema, n_samples=1000, seed=456)

๐Ÿ“š Detailed Documentation

Data Types

Synthetic Generator supports various data types:

  • Numeric: INTEGER, FLOAT
  • Text: STRING, EMAIL, PHONE, ADDRESS, NAME
  • Categorical: CATEGORICAL, BOOLEAN
  • Temporal: DATE, DATETIME

Distributions

Available statistical distributions:

  • Continuous: NORMAL, UNIFORM, EXPONENTIAL, GAMMA, BETA, WEIBULL
  • Discrete: POISSON, BINOMIAL, GEOMETRIC
  • Categorical: CATEGORICAL, CONSTANT

Correlations

Define relationships between variables:

schema = DataSchema(
    columns=[...],
    correlations={
        "height": {"weight": 0.7},  # Height and weight correlation
        "age": {"income": 0.4}      # Age and income correlation
    }
)

Constraints

Apply various constraints to your data:

ColumnSchema(
    name="salary",
    data_type=DataType.FLOAT,
    distribution=DistributionType.NORMAL,
    parameters={"mean": 50000, "std": 15000},
    min_value=30000,        # Minimum value
    max_value=100000,       # Maximum value
    unique=True,            # Unique values
    nullable=True,          # Allow null values
    null_probability=0.05   # 5% null probability
)

Dependencies

Generate data based on other columns:

ColumnSchema(
    name="bonus",
    data_type=DataType.FLOAT,
    distribution=DistributionType.UNIFORM,
    parameters={"low": 0, "high": 10000},
    depends_on=["salary"],
    conditional_rules={
        "rules": [
            {
                "condition": {"salary": {"operator": ">", "value": 70000}},
                "value": 5000
            }
        ],
        "default": 1000
    }
)

๐ŸŽฏ Use Cases

Customer Data

Generate realistic customer profiles with demographics, contact information, and preferences.

Medical Data

Create synthetic patient data with health metrics, demographics, and medical conditions.

Financial Data

Generate transaction data with realistic amounts, categories, and temporal patterns.

E-commerce Data

Create order and product data with realistic relationships and business rules.

๐Ÿ”ง Advanced Features

Optional Web Interface

You can install and run the web UI if needed:

pip install synthetic-generator[web]
synthetic-generator web  # http://localhost:8000

Web Interface

Templates

Schema Inference

Web UI tips (v0.0.7+):

  • Templates: clicking "Use Template" navigates to the Generator and auto-populates columns and parameters.
  • Export: after generating data, export directly from the Generator page via the built-in Export panel (CSV, JSON, Excel, Parquet). There is no separate Export page.
  • Schema Inference: Only infers data types and constraints, not distributions. Users can manually specify distributions in the Generator.
  • Null Probability: Fixed issue where 100% null probability wasn't being applied correctly.
  • JSON Serialization: Fixed NaN values in generated data to properly serialize as null in JSON.

Data Generation

# Generate data with custom parameters
from synthetic_generator import load_template, generate_data

schema = load_template("customer_data")
data = generate_data(schema, n_samples=1000, seed=42)

Data Validation

from synthetic_generator import validate_data

# Validate generated data
results = validate_data(data, schema)
print(f"Valid: {results['valid']}")
print(f"Errors: {results['errors']}")
print(f"Warnings: {results['warnings']}")

Data Export

from synthetic_generator.export import export_data

# Export to various formats
export_data(data, 'csv', filepath='data.csv')
export_data(data, 'json', filepath='data.json')
export_data(data, 'excel', filepath='data.xlsx')
export_data(data, 'parquet', filepath='data.parquet')

๐Ÿ“Š Available Templates

  • customer_data: Customer information with demographics
  • ecommerce_data: E-commerce transaction data
  • medical_data: Medical patient data with health metrics
  • financial_data: Financial transaction data

๐Ÿ“ฆ Package Information

๐Ÿ› ๏ธ Development

Installation for Development

git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic_generator
make install_dev

Running Tests

make test

Running Examples

python examples/basic_usage.py

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

git clone https://github.com/nhatkhangcs/synthetic_generator.git
cd synthetic_generator
make install_dev

๐Ÿ“„ License

Synthetic Generator is released under the MIT License. See LICENSE.txt for details.

๐Ÿš€ Getting Started

For a quick start guide, see QUICKSTART.md.

For detailed examples, check the examples/ directory.

๐Ÿ“ž Contact

Vo Hoang Nhat Khang
Maintainer & Developer
Synthetic Generator - Python Package

Contact via:

๐Ÿ™ Acknowledgments

Thanks to all contributors and the open-source community for making this project possible.


Happy coding with Synthetic Generator! ๐Ÿš€

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic_generator-0.0.8.tar.gz (46.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthetic_generator-0.0.8-py3-none-any.whl (43.9 kB view details)

Uploaded Python 3

File details

Details for the file synthetic_generator-0.0.8.tar.gz.

File metadata

  • Download URL: synthetic_generator-0.0.8.tar.gz
  • Upload date:
  • Size: 46.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for synthetic_generator-0.0.8.tar.gz
Algorithm Hash digest
SHA256 dfddbdd141429c0f9d932247113123f0deda00368ad19ab4b98b151c3e52d3cf
MD5 cb104dfc9ad42ae79a0af10b27e7fdd6
BLAKE2b-256 13fff124dadf651ba467b79cbd57768d1142ae046b84550c169e658da16c8d39

See more details on using hashes here.

File details

Details for the file synthetic_generator-0.0.8-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_generator-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f7459a1454ec4791c019c7e2bc29a0766362bc9926f53982da9c54797b712aef
MD5 4c8ef7f1f3c45d65ae15ebee34546a4a
BLAKE2b-256 c2ad7dc964b6b55f9e341d0778fb3211a86141bb59f52613527387c90a9c9307

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page