RAISE Synthetic data generator
Project description
RAISE Synthetic Data Generator
A Python package to generate shareable versions of tabular datasets ready to upload to the RAISE platform. The package can also be used beyond the scope of the RAISE project as a lightweight synthetic data generator.
Features
The package currently provides a single main function: generate_synthetic_data.
Main capabilities
-
Input flexibility Accepts either:
- A CSV file path, or
- A
pandas.DataFrame.
-
Automatic or manual model selection
"auto-select"(default) automatically chooses the best synthetic data generation model based on the input data.- You can also specify a model explicitly (for the moment one of
CTGAN,TVAEorCopulas).
-
Synthetic data generation
- Generates a synthetic dataset with the same properties as the input data.
- Number of synthetic samples to generate can be specified via
n_samples(defaults to the size of the input data).
-
Results storage
- Saves the generated synthetic dataset as
synthetic_data.csv. - Stores model information (
info.txt) inside the chosen output folder. - Creates a run-specific folder under the desired output path.
- Saves the generated synthetic dataset as
-
Evaluation report (optional)
- If
evaluation_report=True(default), runs a quality assessment comparing original vs synthetic data. - Produces an evaluation report (
evaluation_report.pdf) with figures and summary statistics.
- If
-
Logging and error handling
- Provides informative log messages for each step (dataset loading, model selection, data generation, report creation).
- Exceptions are logged with full traceback and re-raised for debugging.
Installation
You can install raise-synthetic-data-generator directly from PyPI using pip:
pip install raise-synthetic-data-generator
Usage
from raise_synthetic_data_generator import generate_synthetic_data
import pandas as pd
# Example input dataframe
df = pd.DataFrame(
{"age": [23, 35, 44, 29, 31], "country": ["ES", "FR", "DE", "IT", "ES"]}
)
# Generate synthetic data (in memory + saved to disk)
generate_synthetic_data(
dataset=df, # if desired the CSV filename can also be given
selected_model="auto-select", # or explicitly: "CTGAN", "TVAE" or "Copulas
n_samples=10, # number of synthetic samples to generate
evaluation_report=True, # if true (evaluation PDF report is generated)
output_dir="results", # base output directory (if none, results path will be created)
run_name="demo-run", # optional run name (this will be the subfolder where generated objects will be stored, if none a subfolder will be created)
)
This will save in specified output folder:
- The generated synthetic (
synthetic_data.csv) - A text file with the applied model information (
info.txt) - (If selected) A folder with resulted evaluation figures (
evaluation_figures). - (If selected) A PDF report with synthetic data quality evaluation results (
evaluation_report.pdf).
Usage Examples
Code examples demonstrating how to use the raise-synthetic-data-generator package are provided in the examples folder of the repository. You can explore these examples to understand how to utilize the functionality of the package.
To get started, check the examples folder for various scripts and notebooks, such as:
generate_synthetic_data.ipynb: A Jupyter Notebook with step-by-step instructions for generating synthetic data of your dataset.
License
This project is licensed under the European Union Public License (EUPL) version 1.2. See the LICENSE file for more details.
Contributing
We welcome contributions! If you'd like to contribute, please fork the repository, make changes, and submit a pull request. Contributions are subject to the terms of the EUPL license.
Contact
For any inquiries, feel free to reach out via the following email: info@raise-science.eu. More about the project: https://raise-science.eu
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file raise_synthetic_data_generator-0.1.1.tar.gz.
File metadata
- Download URL: raise_synthetic_data_generator-0.1.1.tar.gz
- Upload date:
- Size: 167.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4eebaf2a5f4ae101793e4d674cd0095737ef46705da0f33dc5f98637fab69b4
|
|
| MD5 |
8fc92bc4aa2efd3f7b033d49e64164f1
|
|
| BLAKE2b-256 |
71a12a00e727fd84a2082ba6ae9f5fa4ef3da9a7202ee0cb445118dbfff67422
|
File details
Details for the file raise_synthetic_data_generator-0.1.1-py3-none-any.whl.
File metadata
- Download URL: raise_synthetic_data_generator-0.1.1-py3-none-any.whl
- Upload date:
- Size: 176.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
814d9a27659ba46d4121d6e6be38248d157ea95974e6230b99c273ae919c0429
|
|
| MD5 |
4317b89962c8b47cd044c702d4fe4c66
|
|
| BLAKE2b-256 |
5f83b40aec193f66b9c14cdfac6922689b21f8ecf7bcebed5e5a226da6b6940e
|