Skip to main content

Generate datasets amd models based on vulnerabilities data from Vulnerability-Lookup.

Project description

VulnTrain

Latest release License PyPi version

A tool for generating diverse datasets and models using vulnerability data from Vulnerability-Lookup.

It leverages all vulnerability advisory sources supported by Vulnerability-Lookup to train models, utilizing over one million JSON records.
Additionally, data from the vulnerability-lookup:meta container, including enrichment sources such as vulnrichment and Fraunhofer FKIE, is incorporated to enhance model quality.

Check out the datasets and models on Hugging Face:

Model on HF

For more information about the use of AI in Vulnerability-Lookup, please refer to the user manual.

Usage

Install VulnTrain:

$ pipx install VulnTrain

Three types of commands are available:

  • Dataset generation: Create and prepare datasets.
  • Model training: Train models using the prepared datasets.
    • Train a model for text generation to assist in writing vulnerability descriptions Model on HF
    • Train a model to classify vulnerabilities by severity. Model on HF
  • Model validation: Assess the performance of trained models.

Dataset generation

Make sure you have HuggingFace installed, if you don't, here is the command line to install it :

pip install huggingface_hub

Authenticate to HuggingFace:

huggingface-cli login

Then ensure that the kvrocks database of Vulnerability-Lookup is running.

Creation of datasets:

$ vulntrain-dataset-generation --sources cvelistv5 --repo-id CIRCL/vulnerability-dataset-10k --nb-rows 10000
Generating train split: 9999 examples [00:00, 177710.74 examples/s]
DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'description', 'cpes', 'cvss_v4_0', 'cvss_v3_1', 'cvss_v3_0', 'cvss_v2_0'],
        num_rows: 8999
    })
    test: Dataset({
        features: ['id', 'title', 'description', 'cpes', 'cvss_v4_0', 'cvss_v3_1', 'cvss_v3_0', 'cvss_v2_0'],
        num_rows: 1000
    })
})
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 49.66ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.03s/it]
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 63.36ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.19s/it]
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00<00:00, 2.34MB/s]

In this example, the source consists of vulnerability advisories available in the Kvrocks database of Vulnerability-Lookup. The script will automatically connect to the database using the Valkey client, generate two datasets (containing up to 10,000 rows), and upload the results to Hugging Face.

It is possible to add the GitHub Advisory Database (GHSA) as a source.

Model training

Training for text generation

For now we are using GPT-2 (AutoModelForCausalLM) or distilbert-base-uncased (AutoModelForMaskedLM). The goal is to generate text.

$ vulntrain-train-description-generation --base-model gpt2 --dataset-id CIRCL/vulnerability --repo-id CIRCL/vulnerability-description-generation-gpt2
Using CUDA (Nvidia GPU).
[codecarbon WARNING @ 13:28:13] Multiple instances of codecarbon are allowed to run at the same time.
[codecarbon INFO @ 13:28:13] [setup] RAM Tracking...
[codecarbon INFO @ 13:28:13] [setup] CPU Tracking...
[codecarbon WARNING @ 13:28:13] No CPU tracking mode found. Falling back on CPU constant mode. 
 Linux OS detected: Please ensure RAPL files exist at \sys\class\powercap\intel-rapl to measure CPU

[codecarbon WARNING @ 13:28:14] We saw that you have a AMD EPYC 9124 16-Core Processor but we don't know it. Please contact us.
[codecarbon INFO @ 13:28:14] CPU Model on constant consumption mode: AMD EPYC 9124 16-Core Processor
[codecarbon INFO @ 13:28:14] [setup] GPU Tracking...
[codecarbon INFO @ 13:28:14] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 13:28:14] >>> Tracker's metadata:
[codecarbon INFO @ 13:28:14]   Platform system: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
[codecarbon INFO @ 13:28:14]   Python version: 3.12.3
[codecarbon INFO @ 13:28:14]   CodeCarbon version: 2.8.3
[codecarbon INFO @ 13:28:14]   Available RAM : 251.586 GB
[codecarbon INFO @ 13:28:14]   CPU count: 64
[codecarbon INFO @ 13:28:14]   CPU model: AMD EPYC 9124 16-Core Processor
[codecarbon INFO @ 13:28:14]   GPU count: 2
[codecarbon INFO @ 13:28:14]   GPU model: 2 x NVIDIA L40S
[codecarbon INFO @ 13:28:18] Saving emissions data to file /home/cedric/VulnTrain/emissions.csv                                    | 1/2700 [00:07<5:45:36,  7.68s/it]
...
...
...

Training for classification

$ vulntrain-train-severity-classification --help
usage: vulntrain-train-severity-classification [-h] [--base-model {distilbert-base-uncased,roberta-base}] [--dataset-id DATASET_ID] --repo-id REPO_ID [--model-save-dir MODEL_SAVE_DIR]

Train a vulnerability classification model with a mapping on the severity.

options:
  -h, --help            show this help message and exit
  --base-model {distilbert-base-uncased,roberta-base}
                        Base model to use.
  --dataset-id DATASET_ID
                        Path of the dataset. Local dataset or repository on the HF hub.
  --repo-id REPO_ID     The name of the repository you want to push your object to. It should contain your organization name when pushing to a given organization.
  --model-save-dir MODEL_SAVE_DIR
                        The path to a directory where the tokenizer and the model will be saved.


$ vulntrain-train-severity-classification --base-model roberta-base --dataset-id CIRCL/vulnerability-scores --repo-id CIRCL/vulnerability-severity-classification-roberta-base
...
...
...

Validation

It is possible to send prompts to a model trained for text generation (descriptions of vulnerabilities).

$ vulntrain-validate-text-generation --help
usage: vulntrain-validate-text-generation [-h] [--model MODEL] [--prompt PROMPT]

Validate a text generation model for vulnerabilities.

options:
  -h, --help       show this help message and exit
  --model MODEL    The model to use.
  --prompt PROMPT  The prompt for the generator.

Example:

$ vulntrain-validate-text-generation --prompt "A new vulnerability in OpenSSL allows attackers to" --model CIRCL/vulnerability
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 907/907 [00:00<00:00, 6.70MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 498M/498M [00:12<00:00, 41.3MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 1.63MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 556/556 [00:00<00:00, 4.01MB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 3.25MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 5.58MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.56M/3.56M [00:00<00:00, 10.3MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 470/470 [00:00<00:00, 3.51MB/s]
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

[{'generated_text': 'A new vulnerability in OpenSSL allows attackers to cause a Denial of Service (DoS) when receiving a specially crafted SIP message.\n\n\nThis issue affects: OpenSSL versions prior to 1.2.1\n\n\n\n *  OpenSSL 1.2.1 prior to 1.2.1-HF1, which fixes this issue.\n\n *  OpenSSL version 1.2.1 prior to 1.2.1-HF1 and OpenSSL 1.2.2 prior'}]

License

VulnTrain is licensed under GNU General Public License version 3

Copyright (c) 2025 Computer Incident Response Center Luxembourg (CIRCL)
Copyright (C) 2025 Cédric Bonhomme - https://github.com/cedricbonhomme

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vulntrain-1.4.0.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vulntrain-1.4.0-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file vulntrain-1.4.0.tar.gz.

File metadata

  • Download URL: vulntrain-1.4.0.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vulntrain-1.4.0.tar.gz
Algorithm Hash digest
SHA256 96604937cdab7c7a508312f0d37e51adabcedb7af6c55f6b70e95d5692f92e1f
MD5 4fc692f253ad2154d73795049f7612c5
BLAKE2b-256 323826f7d40a0940da41cd617a853115840fab37c99cb310466e8016c4f305be

See more details on using hashes here.

Provenance

The following attestation bundles were made for vulntrain-1.4.0.tar.gz:

Publisher: release.yml on vulnerability-lookup/VulnTrain

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vulntrain-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: vulntrain-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vulntrain-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 baf86eb7fe988a794a61c2f5b218e03fc5a3a15bd1d72141b848e1266988b701
MD5 e08e6eda1870b59e28b8ddd7a7755a2a
BLAKE2b-256 5dbab52b46f8896ab4ea51a9976be211d407bce51c438c57c833915e2a79ca72

See more details on using hashes here.

Provenance

The following attestation bundles were made for vulntrain-1.4.0-py3-none-any.whl:

Publisher: release.yml on vulnerability-lookup/VulnTrain

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page