Generate datasets amd models based on vulnerabilities data from Vulnerability-Lookup.
Project description
VulnTrain
VulnTrain offers a suite of commands to generate diverse AI datasets and train models using comprehensive vulnerability data from Vulnerability-Lookup. It harnesses over one million JSON records from all supported advisory sources (CVE, GitHub advisories, CSAF, PySecDB, CNVD) to build high-quality, domain-specific models.
Additionally, data from the vulnerability-lookup:meta container, including enrichment sources such as vulnrichment and Fraunhofer FKIE,
is incorporated to enhance model quality.
Check out the datasets and models on Hugging Face:
For more information about the use of AI in Vulnerability-Lookup, please refer to the user manual.
Installation
pipx install VulnTrain
For development:
git clone https://github.com/vulnerability-lookup/VulnTrain.git
cd VulnTrain/
poetry install
Usage
Three types of commands are available:
- Dataset generation: Create and prepare datasets from vulnerability sources.
- Model training: Train models using the prepared datasets.
- Model validation: Assess the performance of trained models (validations, benchmarks, etc.).
CLI commands
| Command | Purpose |
|---|---|
vulntrain-dataset-generation |
Generate datasets from vulnerability sources |
vulntrain-train-severity-classification |
Train severity classifier (RoBERTa/DistilBERT) |
vulntrain-train-severity-cnvd-classification |
Train severity classifier for CNVD data |
vulntrain-train-description-generation |
Train GPT-2 vulnerability description generator |
vulntrain-train-cwe-classification |
Train CWE classifier from patches |
vulntrain-validate-severity-classification |
Validate severity model |
vulntrain-validate-text-generation |
Validate text generation model |
Models
Distributed training on HPC clusters
VulnTrain supports distributed multi-GPU training via SLURM, making it suitable for EuroHPC-style GPU clusters. See the HPC documentation for Conda environment setup, single-node and multi-node SLURM job scripts, and NCCL configuration.
Documentation
Check out the full documentation for detailed usage instructions, dataset generation examples, and training recipes.
How to cite
Bonhomme, C., & Dulaunoy, A. (2025). VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification (Version 1.4.0) [Computer software]. https://doi.org/10.48550/arXiv.2507.03607
@misc{bonhomme2025vlai,
title={VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification},
author={Cédric Bonhomme and Alexandre Dulaunoy},
year={2025},
eprint={2507.03607},
archivePrefix={arXiv},
primaryClass={cs.CR}
}
License
VulnTrain is licensed under GNU General Public License version 3
Copyright (c) 2025-2026 Computer Incident Response Center Luxembourg (CIRCL)
Copyright (C) 2025-2026 Cédric Bonhomme - https://github.com/cedricbonhomme
Copyright (C) 2025 Léa Ulusan - https://github.com/3LS3-1F
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vulntrain-3.1.0.tar.gz.
File metadata
- Download URL: vulntrain-3.1.0.tar.gz
- Upload date:
- Size: 267.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b4d4cd6c7f7c63a380c5d058582b081bab1e8179dcdb4132b4d225a1c923c64
|
|
| MD5 |
add0f29d2fcb6143a68bdcee94e1b72d
|
|
| BLAKE2b-256 |
28da0675186995209cfcdf3a6cd1f17239669f69c52753478c8d1d48d5dfaae0
|
Provenance
The following attestation bundles were made for vulntrain-3.1.0.tar.gz:
Publisher:
release.yml on vulnerability-lookup/VulnTrain
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vulntrain-3.1.0.tar.gz -
Subject digest:
1b4d4cd6c7f7c63a380c5d058582b081bab1e8179dcdb4132b4d225a1c923c64 - Sigstore transparency entry: 1242560546
- Sigstore integration time:
-
Permalink:
vulnerability-lookup/VulnTrain@b3e874a403517432528548b745bd59631b81efc2 -
Branch / Tag:
refs/tags/v3.1.0 - Owner: https://github.com/vulnerability-lookup
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b3e874a403517432528548b745bd59631b81efc2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file vulntrain-3.1.0-py3-none-any.whl.
File metadata
- Download URL: vulntrain-3.1.0-py3-none-any.whl
- Upload date:
- Size: 279.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93911f19c7facc805fd62199a919eb46639133f2b882c91064a9e619239bd1aa
|
|
| MD5 |
6fa3efeb4f8f62a0db8ef5d99f8e3119
|
|
| BLAKE2b-256 |
aec6aa08af7134380eae3e57c6099fde1a7a280c1aa3874bba113258b1bdde29
|
Provenance
The following attestation bundles were made for vulntrain-3.1.0-py3-none-any.whl:
Publisher:
release.yml on vulnerability-lookup/VulnTrain
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vulntrain-3.1.0-py3-none-any.whl -
Subject digest:
93911f19c7facc805fd62199a919eb46639133f2b882c91064a9e619239bd1aa - Sigstore transparency entry: 1242560557
- Sigstore integration time:
-
Permalink:
vulnerability-lookup/VulnTrain@b3e874a403517432528548b745bd59631b81efc2 -
Branch / Tag:
refs/tags/v3.1.0 - Owner: https://github.com/vulnerability-lookup
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b3e874a403517432528548b745bd59631b81efc2 -
Trigger Event:
release
-
Statement type: