A Python library for synthetic data generation, data blending, anomaly injection, and noise injection.
Project description
Ghosted: Synthetic Data Generation and Augmentation Library
Ghosted is a Python library for generating synthetic data, augmenting existing datasets, and visualizing complex data distributions. It’s designed for data scientists, researchers, and developers who need to generate realistic synthetic data or blend synthetic data with real data for privacy-preserving applications, prototyping, testing, and educational purposes.
Features
Ghosted offers powerful features to facilitate synthetic data generation, blending, and augmentation. Here’s a breakdown of its primary capabilities:
- Data Generator: Generate synthetic data with various statistical distributions.
- Data Blender: Seamlessly blend synthetic data with real data to mimic the original distribution.
- Anomaly Injection: Add realistic anomalies to datasets for robust testing.
- Noise Injection: Inject noise into data for simulations and noise-tolerance testing.
- Data Templates: Utilize pre-built templates for synthetic data generation in popular domains like e-commerce, finance, and healthcare.
- Data Visualization: Visualize distributions and relationships in synthetic and real data with histograms, KDE plots, and pairwise plots.
- Data Summary: Easily generate summary statistics for both synthetic and real data.
Installation
Ghosted is compatible with Python 3.12+. Install the library using pip:
pip install ghosted
Usage
1. Data Generation
Ghosted's DataGenerator allows you to generate synthetic data from a variety of common statistical distributions. Supported distributions include, but are not limited to:
| Distribution | Parameters |
|---|---|
| Uniform | min, max |
| Normal | mean, std |
| Binomial | n, p |
| Poisson | lambda |
| Geometric | p |
| Exponential | lambda |
| Categorical | categories, probabilities |
| Lognormal | mean, std |
| Beta | alpha, beta |
| Gamma | shape, rate |
| Multinomial | n, probabilities |
| Pareto | shape |
| Weibull | shape, scale |
| Triangular | low, mode, high |
| Bernoulli | p |
Example
from ghosted.data_generator import DataGenerator
# Define the column specifications
column_spec = {
'age': {'distribution': 'normal', 'mean': 35, 'std': 5},
'income': {'distribution': 'lognormal', 'mean': 10, 'std': 2},
'purchased': {'distribution': 'bernoulli', 'p': 0.3}
}
# Generate data
generator = DataGenerator()
synthetic_df = generator.generate_synthetic_data(column_spec, num_samples=1000)
print(synthetic_df.head())
2. Data Blending
The DataBlender class combines real and synthetic data, preserving the original distribution of the real dataset. This is especially useful for privacy-preserving applications.
Example
from ghosted.data_blender import DataBlender
import pandas as pd
# Real dataset
df = pd.DataFrame({'age': [25, 45, 30], 'income': [50000, 80000, 55000]})
# Blend data
blender = DataBlender()
blended_df = blender.blend_data(df, num_samples=100)
print(blended_df.head())
Correlation Preservation
Ghosted preserves correlations between selected features, enabling better emulation of real-world data relationships.
# Example with correlation preservation
blended_df_with_corr = blender.blend_data(df, num_samples=100, columns_with_correlation=['age', 'income'])
3. Anomaly Injection
The AnomalyInjector class allows you to add anomalies into any dataframe, supporting both extreme values and pattern-breaking injections for numeric and categorical data.
from ghosted.anomaly_injector import AnomalyInjector
injector = AnomalyInjector(anomaly_percentage=0.05, random_seed=42)
df_with_anomalies = injector.inject_anomalies(df, columns=['income'], anomaly_type="extreme_value", factor=3)
print(df_with_anomalies.head())
4. Noise Injection
The NoiseInjector class lets you add controlled noise to numeric or categorical data, ideal for testing model robustness to noise.
from ghosted.noise_injector import NoiseInjector
noise_injector = NoiseInjector(noise_percentage=0.1, noise_intensity=0.2, random_seed=42)
noisy_df = noise_injector.inject_noise(df, columns=['income'], noise_type="gaussian")
print(noisy_df.head())
5. Data Templates
Generate domain-specific synthetic datasets with pre-configured templates using GenerateDataFromTemplate.
from ghosted.generate_data_from_template import GenerateDataFromTemplate
# Instantiate the template generator
template_gen = GenerateDataFromTemplate()
# View available templates
template_gen.list_templates()
# Generate a dataset for e-commerce recommendations
e_commerce_df = template_gen.generate_data('e_commerce_recommendation', num_customers=100, num_products=50)
print(e_commerce_df.head())
6. Data Visualization
Ghosted provides built-in visualization features within the SynthDataFrame class, enabling you to explore distributions and relationships in synthetic and blended data.
Key Visualization Options
- KDE Plot:
.visualize(kind="kde")visualizes distribution densities. - Histogram:
.visualize(kind="hist")shows data distributions in histogram form. - Categorical Counts: Visualize bar charts for categorical columns within
.visualize(). - Pairwise Plot: Analyze correlations with
.pairwise_plot(columns=[...]).
# Visualize distributions for all numerical columns
blended_df.visualize(kind="kde")
# Pairwise plot for specific columns
blended_df.pairwise_plot(columns=["age", "income", "purchased"])
7. Data Summary
Generate summary statistics for datasets, including synthetic and real data comparisons.
# Get summary statistics for blended data
blended_df.summarize()
Contributing
We welcome contributions! If you would like to improve or expand Ghosted, please submit a pull request or open an issue.
License
Ghosted is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ghosted-1.0.2.tar.gz.
File metadata
- Download URL: ghosted-1.0.2.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
392a31699b6cceb490a6dbdf83a2325694ba2b76a5390cabe512b7e97533b4b2
|
|
| MD5 |
9b6ba8f4b26b477e58cea1a2c5efa135
|
|
| BLAKE2b-256 |
ba22dccb9c8dd7f28b1ec14eb0f1e99509d7ecf186e8e4f0b57b973c4af534c9
|
File details
Details for the file ghosted-1.0.2-py3-none-any.whl.
File metadata
- Download URL: ghosted-1.0.2-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4da1978e1165765d9cc82ea57e27a4b54d664f0025a65e631f07fda6cb5b362
|
|
| MD5 |
bc83fb8f4ab0d71f72739722db535803
|
|
| BLAKE2b-256 |
82a3248bcd72ac9e4bf7acf09a405f8b8c991d7be4787e510033004d00af4dcf
|