MalDataGen - Tabular Data Generator

Project description

MalDataGen

Version 1.0.0 (Jellyfish)

MalDataGen is an advanced Python framework for generating and evaluating synthetic tabular datasets using modern generative models. Designed specifically for cybersecurity researchers and malware detection practitioners, it provides reproducible pipelines with fine-grained control over model configuration and integrated evaluation metrics for realistic data synthesis.

The framework supports state-of-the-art generative architectures including GANs (CGAN, WGAN, WGAN-GP), Variational Autoencoders (VAE, TVAE, VQ-VAE), Diffusion Models (Denoising and Latent), and traditional methods like SMOTE. It also integrates with the Synthetic Data Vault (SDV) library to provide additional models such as CTGAN and Copula-based generators.

Installation

Install from source:

git clone https://github.com/SBSeg25/MalDataGen.git
cd MalDataGen
pip install -r requirements.txt

Or use pip directly:

pip install maldatagen

Requirements: Python 3.8+, pip. Optional: CUDA 11+ for GPU acceleration.

Docker execution is also supported via run_demo_docker.sh or run_experiments_docker.sh scripts. Note that Docker execution requires sudo permissions for the Docker engine, while local execution has no security concerns.

Features and Capabilities

MalDataGen provides a comprehensive toolkit for synthetic data generation and evaluation. The framework implements cross-validation with stratified k-fold splitting, fully customizable model configurations, and built-in metrics for assessing data quality. All models and experiments can be persisted for reproducibility, and the system includes graphing utilities for generating publication-ready visualizations including clustering plots, heatmaps comparing synthetic and real samples, confusion matrices, and performance bar graphs.

The evaluation strategy supports two complementary approaches: TS-TR (Train Synthetic, Test Real) which measures generalization ability by training on synthetic data and testing on real data, and TR-TS (Train Real, Test Synthetic) which assesses generative realism by training on real samples and testing on synthetic ones. Both methods use comprehensive metrics including Accuracy, Precision, Recall, F1-score, Specificity, ROC-AUC, MSE, MAE, FNR, and TNR, as well as secondary metrics like Euclidean Distance, Hellinger Distance, Log-Likelihood, and Manhattan Distance.

Supported Models

The framework includes nine native generative models and three third-party models via SDV integration. Native models include CGAN for conditional generation with class balancing, WGAN and WGAN-GP for stable training on imbalanced datasets using Wasserstein distance, standard and Variational Autoencoders for latent space learning, Denoising and Latent Diffusion models for high-quality sample generation, VQ-VAE for discrete latent representations, and SMOTE for traditional interpolation-based oversampling. Third-party models from SDV include TVAE optimized for tabular data, Copula for preserving statistical dependencies, and CTGAN with mode-specific normalization for mixed-type data.

Output Structure

After execution, the framework generates a comprehensive output structure organized by model. Each model folder contains five subdirectories: Data Generated (synthetic datasets and partitioned real data subsets), Evaluation Results (clustering visualizations, heatmaps, confusion matrices, and metric bar graphs), Logs (execution logs), Monitor (raw monitoring data), and Models Saved (serialized models for each fold if saving is enabled). Additionally, a comparative PDF report for SVM classifier performance across all models is generated in the project root.

System Requirements

The framework runs on Linux (Ubuntu 22.04+ preferred) with Python 3.8.10 or higher. Minimum requirements are any x86_64 CPU with 4 GB RAM and 10 GB storage. Recommended configuration includes a multi-core CPU (Intel i5 or AMD Ryzen 5+), 8 GB+ RAM, and 20 GB SSD storage. GPU acceleration via NVIDIA cards with CUDA 11+ is optional but recommended for faster training. Docker 27.2.1+ is optional for containerized execution.

Documentation and Resources

Complete documentation is available in the repository. The Docs/ directory contains API reference documentation, Docs/Diagrams/ provides eight comprehensive architecture diagrams created with Mermaid notation, and Docs/Overview.md explains model architectures in detail. The project website at https://kayua.github.io/SyntheticDataGen.github.io/ provides additional resources, and demonstration videos are available at https://drive.google.com/file/d/1sbPZ1x5Np6zolhFvCBWoMzqNqrthlUe3/view (backup: https://youtu.be/t-AZtsLJUlQ).

Citation

If you use MalDataGen in your research, please cite:

@inproceedings{sbseg25_maldatagen,
 author = {Kayuã Paim and Angelo Nogueira and Diego Kreutz and Weverton Cordeiro and Rodrigo Mansilha},
 title = {MalDataGen: A Modular Framework for Synthetic Tabular Data Generation in Malware Detection},
 booktitle = {Companion Proceedings of the 25th Brazilian Symposium on Cybersecurity},
 location = {Foz do Iguaçu/PR},
 year = {2025},
 pages = {38--47},
 publisher = {SBC},
 address = {Porto Alegre, RS, Brasil},
 doi = {10.5753/sbseg_estendido.2025.12113},
 url = {https://sol.sbc.org.br/index.php/sbseg_estendido/article/view/36739}
}

Awards and Recognition

MalDataGen received the Highlighted Artifact award at SBSEG 25 and was recognized as the Best Tool of SBSEG 2025. Award details available at https://doc-artefatos.github.io/sbseg2025/results.html and https://sbseg2025.ppgia.pucpr.br/wp-content/uploads/2025/09/PremiacaoSBSEG-2025.pdf.

Key References

The framework builds upon foundational work in generative modeling including Kingma & Welling (2013) on Variational Autoencoders, Goodfellow et al. (2014) on Generative Adversarial Networks, Ho et al. (2020) on Denoising Diffusion Probabilistic Models, Arjovsky et al. (2017) on Wasserstein GANs, and van den Oord et al. (2017) on VQ-VAE. SDV integration is based on Patki et al. (2016) and Xu et al. (2019). Complete references available in the repository documentation.

License

Distributed under the MIT License. See LICENSE file for details.

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Nov 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maldatagen-0.1.1.tar.gz (359.9 kB view details)

Uploaded Nov 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

maldatagen-0.1.1-py3-none-any.whl (706.7 kB view details)

Uploaded Nov 15, 2025 Python 3

File details

Details for the file maldatagen-0.1.1.tar.gz.

File metadata

Download URL: maldatagen-0.1.1.tar.gz
Upload date: Nov 15, 2025
Size: 359.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for maldatagen-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`296819ac5b1ab0167146926e9eec3185d544b311b034df7179c8474a2baed713`
MD5	`e025c253fd47ee05767793c0cbf09eb1`
BLAKE2b-256	`5edd11922ee81d3deca3ec40baf919f0ac0d6a603806dd0600b77be660c6605e`

See more details on using hashes here.

File details

Details for the file maldatagen-0.1.1-py3-none-any.whl.

File metadata

Download URL: maldatagen-0.1.1-py3-none-any.whl
Upload date: Nov 15, 2025
Size: 706.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for maldatagen-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`daa6717515cbaebb33e6e8341544dade3a33e534f5e0cf61315c6e475b21b25b`
MD5	`31ce696b467d177db54c2e2c1a97fbb2`
BLAKE2b-256	`8b69c54c2b53cbbd910a921afeb62cb058f115978b7d63c3b82c96580c4d40a6`

See more details on using hashes here.

maldatagen 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

MalDataGen

Installation

Features and Capabilities

Supported Models

Output Structure

System Requirements

Documentation and Resources

Citation

Awards and Recognition

Key References

License

Links

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes