Skip to main content

A Python library for synthetic data generation, data blending, anomaly injection, and noise injection.

Project description

Ghosted: Synthetic Data Generation and Augmentation Library

Ghosted is a Python library for generating synthetic data, augmenting existing datasets, and visualizing complex data distributions. It’s designed for data scientists, researchers, and developers who need to generate realistic synthetic data or blend synthetic data with real data for privacy-preserving applications, prototyping, testing, and educational purposes.

Features

Ghosted offers powerful features to facilitate synthetic data generation, blending, and augmentation. Here’s a breakdown of its primary capabilities:

  1. Data Generator: Generate synthetic data with various statistical distributions.
  2. Data Blender: Seamlessly blend synthetic data with real data to mimic the original distribution.
  3. Anomaly Injection: Add realistic anomalies to datasets for robust testing.
  4. Noise Injection: Inject noise into data for simulations and noise-tolerance testing.
  5. Data Templates: Utilize pre-built templates for synthetic data generation in popular domains like e-commerce, finance, and healthcare.
  6. Data Visualization: Visualize distributions and relationships in synthetic and real data with histograms, KDE plots, and pairwise plots.
  7. Data Summary: Easily generate summary statistics for both synthetic and real data.

Installation

Ghosted is compatible with Python 3.12+. Install the library using pip:

pip install ghosted

Usage

1. Data Generation

Ghosted's DataGenerator allows you to generate synthetic data from a variety of common statistical distributions. Supported distributions include, but are not limited to:

Distribution Parameters
Uniform min, max
Normal mean, std
Binomial n, p
Poisson lambda
Geometric p
Exponential lambda
Categorical categories, probabilities
Lognormal mean, std
Beta alpha, beta
Gamma shape, rate
Multinomial n, probabilities
Pareto shape
Weibull shape, scale
Triangular low, mode, high
Bernoulli p

Example

from ghosted.data_generator import DataGenerator

# Define the column specifications
column_spec = {
    'age': {'distribution': 'normal', 'mean': 35, 'std': 5},
    'income': {'distribution': 'lognormal', 'mean': 10, 'std': 2},
    'purchased': {'distribution': 'bernoulli', 'p': 0.3}
}

# Generate data
generator = DataGenerator()
synthetic_df = generator.generate_synthetic_data(column_spec, num_samples=1000)
print(synthetic_df.head())

2. Data Blending

The DataBlender class combines real and synthetic data, preserving the original distribution of the real dataset. This is especially useful for privacy-preserving applications.

Example

from ghosted.data_blender import DataBlender
import pandas as pd

# Real dataset
df = pd.DataFrame({'age': [25, 45, 30], 'income': [50000, 80000, 55000]})

# Blend data
blender = DataBlender()
blended_df = blender.blend_data(df, num_samples=100)
print(blended_df.head())

Correlation Preservation

Ghosted preserves correlations between selected features, enabling better emulation of real-world data relationships.

# Example with correlation preservation
blended_df_with_corr = blender.blend_data(df, num_samples=100, columns_with_correlation=['age', 'income'])

3. Anomaly Injection

The AnomalyInjector class allows you to add anomalies into any dataframe, supporting both extreme values and pattern-breaking injections for numeric and categorical data.

from ghosted.anomaly_injector import AnomalyInjector

injector = AnomalyInjector(anomaly_percentage=0.05, random_seed=42)
df_with_anomalies = injector.inject_anomalies(df, columns=['income'], anomaly_type="extreme_value", factor=3)
print(df_with_anomalies.head())

4. Noise Injection

The NoiseInjector class lets you add controlled noise to numeric or categorical data, ideal for testing model robustness to noise.

from ghosted.noise_injector import NoiseInjector

noise_injector = NoiseInjector(noise_percentage=0.1, noise_intensity=0.2, random_seed=42)
noisy_df = noise_injector.inject_noise(df, columns=['income'], noise_type="gaussian")
print(noisy_df.head())

5. Data Templates

Generate domain-specific synthetic datasets with pre-configured templates using GenerateDataFromTemplate.

from ghosted.generate_data_from_template import GenerateDataFromTemplate

# Instantiate the template generator
template_gen = GenerateDataFromTemplate()

# View available templates
template_gen.list_templates()

# Generate a dataset for e-commerce recommendations
e_commerce_df = template_gen.generate_data('e_commerce_recommendation', num_customers=100, num_products=50)
print(e_commerce_df.head())

6. Data Visualization

Ghosted provides built-in visualization features within the SynthDataFrame class, enabling you to explore distributions and relationships in synthetic and blended data.

Key Visualization Options

  • KDE Plot: .visualize(kind="kde") visualizes distribution densities.
  • Histogram: .visualize(kind="hist") shows data distributions in histogram form.
  • Categorical Counts: Visualize bar charts for categorical columns within .visualize().
  • Pairwise Plot: Analyze correlations with .pairwise_plot(columns=[...]).
# Visualize distributions for all numerical columns
blended_df.visualize(kind="kde")

# Pairwise plot for specific columns
blended_df.pairwise_plot(columns=["age", "income", "purchased"])

7. Data Summary

Generate summary statistics for datasets, including synthetic and real data comparisons.

# Get summary statistics for blended data
blended_df.summarize()

Contributing

We welcome contributions! If you would like to improve or expand Ghosted, please submit a pull request or open an issue.

License

Ghosted is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghosted-1.0.2.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghosted-1.0.2-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file ghosted-1.0.2.tar.gz.

File metadata

  • Download URL: ghosted-1.0.2.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for ghosted-1.0.2.tar.gz
Algorithm Hash digest
SHA256 392a31699b6cceb490a6dbdf83a2325694ba2b76a5390cabe512b7e97533b4b2
MD5 9b6ba8f4b26b477e58cea1a2c5efa135
BLAKE2b-256 ba22dccb9c8dd7f28b1ec14eb0f1e99509d7ecf186e8e4f0b57b973c4af534c9

See more details on using hashes here.

File details

Details for the file ghosted-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ghosted-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for ghosted-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c4da1978e1165765d9cc82ea57e27a4b54d664f0025a65e631f07fda6cb5b362
MD5 bc83fb8f4ab0d71f72739722db535803
BLAKE2b-256 82a3248bcd72ac9e4bf7acf09a405f8b8c991d7be4787e510033004d00af4dcf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page