MalDataGen - Tabular Data Generator
Project description
MalDataGen
Version 1.0.0 (Jellyfish)
MalDataGen is an advanced Python framework for generating and evaluating synthetic tabular datasets using modern generative models. Designed specifically for cybersecurity researchers and malware detection practitioners, it provides reproducible pipelines with fine-grained control over model configuration and integrated evaluation metrics for realistic data synthesis.
The framework supports state-of-the-art generative architectures including GANs (CGAN, WGAN, WGAN-GP), Variational Autoencoders (VAE, TVAE, VQ-VAE), Diffusion Models (Denoising and Latent), and traditional methods like SMOTE. It also integrates with the Synthetic Data Vault (SDV) library to provide additional models such as CTGAN and Copula-based generators.
Installation
Install from source:
git clone https://github.com/SBSeg25/MalDataGen.git
cd MalDataGen
pip install -r requirements.txt
Or use pip directly:
pip install maldatagen
Requirements: Python 3.8+, pip. Optional: CUDA 11+ for GPU acceleration.
Docker execution is also supported via run_demo_docker.sh or run_experiments_docker.sh scripts. Note that Docker execution requires sudo permissions for the Docker engine, while local execution has no security concerns.
Features and Capabilities
MalDataGen provides a comprehensive toolkit for synthetic data generation and evaluation. The framework implements cross-validation with stratified k-fold splitting, fully customizable model configurations, and built-in metrics for assessing data quality. All models and experiments can be persisted for reproducibility, and the system includes graphing utilities for generating publication-ready visualizations including clustering plots, heatmaps comparing synthetic and real samples, confusion matrices, and performance bar graphs.
The evaluation strategy supports two complementary approaches: TS-TR (Train Synthetic, Test Real) which measures generalization ability by training on synthetic data and testing on real data, and TR-TS (Train Real, Test Synthetic) which assesses generative realism by training on real samples and testing on synthetic ones. Both methods use comprehensive metrics including Accuracy, Precision, Recall, F1-score, Specificity, ROC-AUC, MSE, MAE, FNR, and TNR, as well as secondary metrics like Euclidean Distance, Hellinger Distance, Log-Likelihood, and Manhattan Distance.
Supported Models
The framework includes nine native generative models and three third-party models via SDV integration. Native models include CGAN for conditional generation with class balancing, WGAN and WGAN-GP for stable training on imbalanced datasets using Wasserstein distance, standard and Variational Autoencoders for latent space learning, Denoising and Latent Diffusion models for high-quality sample generation, VQ-VAE for discrete latent representations, and SMOTE for traditional interpolation-based oversampling. Third-party models from SDV include TVAE optimized for tabular data, Copula for preserving statistical dependencies, and CTGAN with mode-specific normalization for mixed-type data.
Output Structure
After execution, the framework generates a comprehensive output structure organized by model. Each model folder contains five subdirectories: Data Generated (synthetic datasets and partitioned real data subsets), Evaluation Results (clustering visualizations, heatmaps, confusion matrices, and metric bar graphs), Logs (execution logs), Monitor (raw monitoring data), and Models Saved (serialized models for each fold if saving is enabled). Additionally, a comparative PDF report for SVM classifier performance across all models is generated in the project root.
System Requirements
The framework runs on Linux (Ubuntu 22.04+ preferred) with Python 3.8.10 or higher. Minimum requirements are any x86_64 CPU with 4 GB RAM and 10 GB storage. Recommended configuration includes a multi-core CPU (Intel i5 or AMD Ryzen 5+), 8 GB+ RAM, and 20 GB SSD storage. GPU acceleration via NVIDIA cards with CUDA 11+ is optional but recommended for faster training. Docker 27.2.1+ is optional for containerized execution.
Documentation and Resources
Complete documentation is available in the repository. The Docs/ directory contains API reference documentation, Docs/Diagrams/ provides eight comprehensive architecture diagrams created with Mermaid notation, and Docs/Overview.md explains model architectures in detail. The project website at https://kayua.github.io/SyntheticDataGen.github.io/ provides additional resources, and demonstration videos are available at https://drive.google.com/file/d/1sbPZ1x5Np6zolhFvCBWoMzqNqrthlUe3/view (backup: https://youtu.be/t-AZtsLJUlQ).
Citation
If you use MalDataGen in your research, please cite:
@inproceedings{sbseg25_maldatagen,
author = {Kayuã Paim and Angelo Nogueira and Diego Kreutz and Weverton Cordeiro and Rodrigo Mansilha},
title = {MalDataGen: A Modular Framework for Synthetic Tabular Data Generation in Malware Detection},
booktitle = {Companion Proceedings of the 25th Brazilian Symposium on Cybersecurity},
location = {Foz do Iguaçu/PR},
year = {2025},
pages = {38--47},
publisher = {SBC},
address = {Porto Alegre, RS, Brasil},
doi = {10.5753/sbseg_estendido.2025.12113},
url = {https://sol.sbc.org.br/index.php/sbseg_estendido/article/view/36739}
}
Awards and Recognition
MalDataGen received the Highlighted Artifact award at SBSEG 25 and was recognized as the Best Tool of SBSEG 2025. Award details available at https://doc-artefatos.github.io/sbseg2025/results.html and https://sbseg2025.ppgia.pucpr.br/wp-content/uploads/2025/09/PremiacaoSBSEG-2025.pdf.
Key References
The framework builds upon foundational work in generative modeling including Kingma & Welling (2013) on Variational Autoencoders, Goodfellow et al. (2014) on Generative Adversarial Networks, Ho et al. (2020) on Denoising Diffusion Probabilistic Models, Arjovsky et al. (2017) on Wasserstein GANs, and van den Oord et al. (2017) on VQ-VAE. SDV integration is based on Patki et al. (2016) and Xu et al. (2019). Complete references available in the repository documentation.
License
Distributed under the MIT License. See LICENSE file for details.
Links
- Repository: https://github.com/SBSeg25/MalDataGen
- Documentation: https://github.com/SBSeg25/MalDataGen/tree/main/Docs
- Issues: https://github.com/SBSeg25/MalDataGen/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file maldatagen-0.1.1.tar.gz.
File metadata
- Download URL: maldatagen-0.1.1.tar.gz
- Upload date:
- Size: 359.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
296819ac5b1ab0167146926e9eec3185d544b311b034df7179c8474a2baed713
|
|
| MD5 |
e025c253fd47ee05767793c0cbf09eb1
|
|
| BLAKE2b-256 |
5edd11922ee81d3deca3ec40baf919f0ac0d6a603806dd0600b77be660c6605e
|
File details
Details for the file maldatagen-0.1.1-py3-none-any.whl.
File metadata
- Download URL: maldatagen-0.1.1-py3-none-any.whl
- Upload date:
- Size: 706.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daa6717515cbaebb33e6e8341544dade3a33e534f5e0cf61315c6e475b21b25b
|
|
| MD5 |
31ce696b467d177db54c2e2c1a97fbb2
|
|
| BLAKE2b-256 |
8b69c54c2b53cbbd910a921afeb62cb058f115978b7d63c3b82c96580c4d40a6
|