Generate datasets amd models based on vulnerabilities descriptions from Vulnerability-Lookup.

These details have been verified by PyPI

Project links

Owner

CIRCL

GitHub Statistics

These details have not been verified by PyPI

Project description

VulnTrain

Generate datasets amd models based on vulnerabilities descriptions from Vulnerability-Lookup.

Uses data from the vulnerability-lookup:meta container such as vulnrichment and FKIE.

Datasets

Various datasets generated are available on HuggingFace:

https://huggingface.co/datasets/circl/vulnerability-dataset

Usage

Generate datasets

Authenticate to HuggingFace:

huggingface-cli login

Install VulnTrain:

$ pipx install VulnTrain

Then ensures that the kvrocks database of Vulnerability-Lookup is running.

Creation of datasets:

$ vulntrain-create-dataset --nb-rows 10000 --upload --repo-id CIRCL/vulnerability-dataset-10k
Generating train split: 9999 examples [00:00, 177710.74 examples/s]
DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'description', 'cpes'],
        num_rows: 8999
    })
    test: Dataset({
        features: ['id', 'title', 'description', 'cpes'],
        num_rows: 1000
    })
})
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 49.66ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.03s/it]
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 63.36ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.19s/it]
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00<00:00, 2.34MB/s]

Train

Training for text generation

For now we are using distilbert-base-uncased (AutoModelForMaskedLM) or gpt2 (AutoModelForCausalLM). The goal is to generate text.

$ vulntrain-train-dataset 
Using CUDA (Nvidia GPU).
[codecarbon WARNING @ 13:28:13] Multiple instances of codecarbon are allowed to run at the same time.
[codecarbon INFO @ 13:28:13] [setup] RAM Tracking...
[codecarbon INFO @ 13:28:13] [setup] CPU Tracking...
[codecarbon WARNING @ 13:28:13] No CPU tracking mode found. Falling back on CPU constant mode. 
 Linux OS detected: Please ensure RAPL files exist at \sys\class\powercap\intel-rapl to measure CPU

[codecarbon WARNING @ 13:28:14] We saw that you have a AMD EPYC 9124 16-Core Processor but we don't know it. Please contact us.
[codecarbon INFO @ 13:28:14] CPU Model on constant consumption mode: AMD EPYC 9124 16-Core Processor
[codecarbon INFO @ 13:28:14] [setup] GPU Tracking...
[codecarbon INFO @ 13:28:14] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 13:28:14] >>> Tracker's metadata:
[codecarbon INFO @ 13:28:14]   Platform system: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
[codecarbon INFO @ 13:28:14]   Python version: 3.12.3
[codecarbon INFO @ 13:28:14]   CodeCarbon version: 2.8.3
[codecarbon INFO @ 13:28:14]   Available RAM : 251.586 GB
[codecarbon INFO @ 13:28:14]   CPU count: 64
[codecarbon INFO @ 13:28:14]   CPU model: AMD EPYC 9124 16-Core Processor
[codecarbon INFO @ 13:28:14]   GPU count: 2
[codecarbon INFO @ 13:28:14]   GPU model: 2 x NVIDIA L40S
[codecarbon INFO @ 13:28:18] Saving emissions data to file /home/cedric/VulnTrain/emissions.csv                                    | 1/2700 [00:07<5:45:36,  7.68s/it]
...
...
...

Training for classification

tf-idf on the vulnerability descriptions.

Validation

It is possible to send prompts to a model trained for text generation (descriptions of vulnerabilities).

$ vulntrain-validate-text-generation --help
usage: vulntrain-validate-text-generation [-h] [--model MODEL] [--prompt PROMPT]

Validate a text generation model for vulnerabilities.

options:
  -h, --help       show this help message and exit
  --model MODEL    The model to use.
  --prompt PROMPT  The prompt for the generator.

Example:

$ vulntrain-validate-text-generation --prompt "A new vulnerability in OpenSSL allows attackers to" --model CIRCL/vulnerability
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 907/907 [00:00<00:00, 6.70MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 498M/498M [00:12<00:00, 41.3MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 1.63MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 556/556 [00:00<00:00, 4.01MB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 3.25MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 5.58MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.56M/3.56M [00:00<00:00, 10.3MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 470/470 [00:00<00:00, 3.51MB/s]
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

[{'generated_text': 'A new vulnerability in OpenSSL allows attackers to cause a Denial of Service (DoS) when receiving a specially crafted SIP message.\n\n\nThis issue affects: OpenSSL versions prior to 1.2.1\n\n\n\n *  OpenSSL 1.2.1 prior to 1.2.1-HF1, which fixes this issue.\n\n *  OpenSSL version 1.2.1 prior to 1.2.1-HF1 and OpenSSL 1.2.2 prior'}]

License

VulnTrain is licensed under GNU General Public License version 3

Copyright (c) 2025 Computer Incident Response Center Luxembourg (CIRCL)
Copyright (C) 2025 Cédric Bonhomme - https://github.com/cedricbonhomme

Project details

These details have been verified by PyPI

Project links

Owner

CIRCL

GitHub Statistics

These details have not been verified by PyPI

Release history Release notifications | RSS feed

3.1.0

Apr 6, 2026

3.0.0

Apr 3, 2026

2.2.0

Feb 19, 2026

2.1.0

Nov 18, 2025

2.0.0

Sep 5, 2025

1.5.0

Jul 25, 2025

1.4.0

Jul 1, 2025

1.3.1

Apr 28, 2025

1.3.0

Apr 28, 2025

1.2.0

Mar 11, 2025

1.1.0

Feb 27, 2025

1.0.0

Feb 25, 2025

0.5.1

Feb 21, 2025

0.5.0

Feb 21, 2025

0.4.0

Feb 21, 2025

This version

0.3.0

Feb 20, 2025

0.2.0

Feb 20, 2025

0.1.0

Feb 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vulntrain-0.3.0.tar.gz (8.4 kB view details)

Uploaded Feb 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vulntrain-0.3.0-py3-none-any.whl (8.6 kB view details)

Uploaded Feb 20, 2025 Python 3

File details

Details for the file vulntrain-0.3.0.tar.gz.

File metadata

Download URL: vulntrain-0.3.0.tar.gz
Upload date: Feb 20, 2025
Size: 8.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vulntrain-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`bb697fb0cdfdbcbe38f993f9149e5e103bbeb8c18423aec9b92be4a5a834c477`
MD5	`ebcb12ea5d74200c222a8aa237f1e716`
BLAKE2b-256	`530093eece2d98b5c732c5efe9560173c1b641adbe169d4d57c59e635b160b39`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vulntrain-0.3.0.tar.gz:

Publisher: release.yml on vulnerability-lookup/VulnTrain

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vulntrain-0.3.0.tar.gz
- Subject digest: bb697fb0cdfdbcbe38f993f9149e5e103bbeb8c18423aec9b92be4a5a834c477
- Sigstore transparency entry: 173151541
- Sigstore integration time: Feb 20, 2025
Source repository:
- Permalink: vulnerability-lookup/VulnTrain@35918af5a8ba7ea3712f875095d0f0953af621d3
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/vulnerability-lookup
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@35918af5a8ba7ea3712f875095d0f0953af621d3
- Trigger Event: release

File details

Details for the file vulntrain-0.3.0-py3-none-any.whl.

File metadata

Download URL: vulntrain-0.3.0-py3-none-any.whl
Upload date: Feb 20, 2025
Size: 8.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vulntrain-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb04557c1dc7a83da69a1aabb2f0db475c340b309847e0421fb87ed42bbfce3e`
MD5	`df4e73e8acbcd88860c56bd002bd5f55`
BLAKE2b-256	`a01e34961fd10a4e2099bc0bc54313d442704035bd8a6cf66a81fcf7ccc1d518`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vulntrain-0.3.0-py3-none-any.whl:

Publisher: release.yml on vulnerability-lookup/VulnTrain

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vulntrain-0.3.0-py3-none-any.whl
- Subject digest: cb04557c1dc7a83da69a1aabb2f0db475c340b309847e0421fb87ed42bbfce3e
- Sigstore transparency entry: 173151545
- Sigstore integration time: Feb 20, 2025
Source repository:
- Permalink: vulnerability-lookup/VulnTrain@35918af5a8ba7ea3712f875095d0f0953af621d3
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/vulnerability-lookup
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@35918af5a8ba7ea3712f875095d0f0953af621d3
- Trigger Event: release

VulnTrain 0.3.0

Navigation

Verified details

Project links

Owner

GitHub Statistics

Meta

Unverified details

Meta

Classifiers

Project description

VulnTrain

Datasets

Usage

Generate datasets

Train

Training for text generation

Training for classification

Validation

License

Project details

Verified details

Project links

Owner

GitHub Statistics

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance