Skip to main content

MalDataGen - Tabular Data Generator

Project description

MalDataGen

Version 1.0.0 (Jellyfish)

MalDataGen is an advanced Python framework for generating and evaluating synthetic tabular datasets using modern generative models. Designed specifically for cybersecurity researchers and malware detection practitioners, it provides reproducible pipelines with fine-grained control over model configuration and integrated evaluation metrics for realistic data synthesis.

The framework supports state-of-the-art generative architectures including GANs (CGAN, WGAN, WGAN-GP), Variational Autoencoders (VAE, TVAE, VQ-VAE), Diffusion Models (Denoising and Latent), and traditional methods like SMOTE. It also integrates with the Synthetic Data Vault (SDV) library to provide additional models such as CTGAN and Copula-based generators.

Installation

Install from source:

git clone https://github.com/SBSeg25/MalDataGen.git
cd MalDataGen
pip install -r requirements.txt

Or use pip directly:

pip install maldatagen

Requirements: Python 3.8+, pip. Optional: CUDA 11+ for GPU acceleration.

Docker execution is also supported via run_demo_docker.sh or run_experiments_docker.sh scripts. Note that Docker execution requires sudo permissions for the Docker engine, while local execution has no security concerns.

Features and Capabilities

MalDataGen provides a comprehensive toolkit for synthetic data generation and evaluation. The framework implements cross-validation with stratified k-fold splitting, fully customizable model configurations, and built-in metrics for assessing data quality. All models and experiments can be persisted for reproducibility, and the system includes graphing utilities for generating publication-ready visualizations including clustering plots, heatmaps comparing synthetic and real samples, confusion matrices, and performance bar graphs.

The evaluation strategy supports two complementary approaches: TS-TR (Train Synthetic, Test Real) which measures generalization ability by training on synthetic data and testing on real data, and TR-TS (Train Real, Test Synthetic) which assesses generative realism by training on real samples and testing on synthetic ones. Both methods use comprehensive metrics including Accuracy, Precision, Recall, F1-score, Specificity, ROC-AUC, MSE, MAE, FNR, and TNR, as well as secondary metrics like Euclidean Distance, Hellinger Distance, Log-Likelihood, and Manhattan Distance.

Supported Models

The framework includes nine native generative models and three third-party models via SDV integration. Native models include CGAN for conditional generation with class balancing, WGAN and WGAN-GP for stable training on imbalanced datasets using Wasserstein distance, standard and Variational Autoencoders for latent space learning, Denoising and Latent Diffusion models for high-quality sample generation, VQ-VAE for discrete latent representations, and SMOTE for traditional interpolation-based oversampling. Third-party models from SDV include TVAE optimized for tabular data, Copula for preserving statistical dependencies, and CTGAN with mode-specific normalization for mixed-type data.

Output Structure

After execution, the framework generates a comprehensive output structure organized by model. Each model folder contains five subdirectories: Data Generated (synthetic datasets and partitioned real data subsets), Evaluation Results (clustering visualizations, heatmaps, confusion matrices, and metric bar graphs), Logs (execution logs), Monitor (raw monitoring data), and Models Saved (serialized models for each fold if saving is enabled). Additionally, a comparative PDF report for SVM classifier performance across all models is generated in the project root.

System Requirements

The framework runs on Linux (Ubuntu 22.04+ preferred) with Python 3.8.10 or higher. Minimum requirements are any x86_64 CPU with 4 GB RAM and 10 GB storage. Recommended configuration includes a multi-core CPU (Intel i5 or AMD Ryzen 5+), 8 GB+ RAM, and 20 GB SSD storage. GPU acceleration via NVIDIA cards with CUDA 11+ is optional but recommended for faster training. Docker 27.2.1+ is optional for containerized execution.

Documentation and Resources

Complete documentation is available in the repository. The Docs/ directory contains API reference documentation, Docs/Diagrams/ provides eight comprehensive architecture diagrams created with Mermaid notation, and Docs/Overview.md explains model architectures in detail. The project website at https://kayua.github.io/SyntheticDataGen.github.io/ provides additional resources, and demonstration videos are available at https://drive.google.com/file/d/1sbPZ1x5Np6zolhFvCBWoMzqNqrthlUe3/view (backup: https://youtu.be/t-AZtsLJUlQ).

Citation

If you use MalDataGen in your research, please cite:

@inproceedings{sbseg25_maldatagen,
 author = {Kayuã Paim and Angelo Nogueira and Diego Kreutz and Weverton Cordeiro and Rodrigo Mansilha},
 title = {MalDataGen: A Modular Framework for Synthetic Tabular Data Generation in Malware Detection},
 booktitle = {Companion Proceedings of the 25th Brazilian Symposium on Cybersecurity},
 location = {Foz do Iguaçu/PR},
 year = {2025},
 pages = {38--47},
 publisher = {SBC},
 address = {Porto Alegre, RS, Brasil},
 doi = {10.5753/sbseg_estendido.2025.12113},
 url = {https://sol.sbc.org.br/index.php/sbseg_estendido/article/view/36739}
}

Awards and Recognition

MalDataGen received the Highlighted Artifact award at SBSEG 25 and was recognized as the Best Tool of SBSEG 2025. Award details available at https://doc-artefatos.github.io/sbseg2025/results.html and https://sbseg2025.ppgia.pucpr.br/wp-content/uploads/2025/09/PremiacaoSBSEG-2025.pdf.

Key References

The framework builds upon foundational work in generative modeling including Kingma & Welling (2013) on Variational Autoencoders, Goodfellow et al. (2014) on Generative Adversarial Networks, Ho et al. (2020) on Denoising Diffusion Probabilistic Models, Arjovsky et al. (2017) on Wasserstein GANs, and van den Oord et al. (2017) on VQ-VAE. SDV integration is based on Patki et al. (2016) and Xu et al. (2019). Complete references available in the repository documentation.

License

Distributed under the MIT License. See LICENSE file for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maldatagen-0.1.1.tar.gz (359.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maldatagen-0.1.1-py3-none-any.whl (706.7 kB view details)

Uploaded Python 3

File details

Details for the file maldatagen-0.1.1.tar.gz.

File metadata

  • Download URL: maldatagen-0.1.1.tar.gz
  • Upload date:
  • Size: 359.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for maldatagen-0.1.1.tar.gz
Algorithm Hash digest
SHA256 296819ac5b1ab0167146926e9eec3185d544b311b034df7179c8474a2baed713
MD5 e025c253fd47ee05767793c0cbf09eb1
BLAKE2b-256 5edd11922ee81d3deca3ec40baf919f0ac0d6a603806dd0600b77be660c6605e

See more details on using hashes here.

File details

Details for the file maldatagen-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: maldatagen-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 706.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for maldatagen-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 daa6717515cbaebb33e6e8341544dade3a33e534f5e0cf61315c6e475b21b25b
MD5 31ce696b467d177db54c2e2c1a97fbb2
BLAKE2b-256 8b69c54c2b53cbbd910a921afeb62cb058f115978b7d63c3b82c96580c4d40a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page