GNN-based vulnerability detection for code — Final Project (Tugas Akhir)
Project description
gnn_vuln — Library API Reference
The installable model library behind the vulnerability-detection service. This is the complete public surface: what to import, the inputs, and the outputs.
Not everything is file-based. You pass a function source string and get a result dict back. The only files involved are the model checkpoint + config (normal — weights and config live on disk) and the Joern CPG, which is created in a private temp dir and hidden from you. In-memory in, in-memory out.
Install
# 1. torch + PyG sparse ext from their own indexes (PyPI can't resolve these alone)
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu # or cu124
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.6.0+cpu.html
# 2. the library
pip install gnn-vuln
Plus Joern (CPG generation) + a JDK 21 on the host. Point the predictor at the
joern-cli directory.
Inference — gnn_vuln.inference
VulnPredictor (high-level, recommended)
from gnn_vuln.inference import VulnPredictor
predictor = VulnPredictor.from_checkpoint(
checkpoint="checkpoints/<run>/best_model.pt", # trained weights (.pt file)
config="configs/<arch>/config.yaml", # its config (file, or pass a list)
device="cuda", # "cpu" | "cuda"
)
predictor.class_names = ["benign", "CWE-787", ...] # optional: override label names
| Method | Input | Output |
|---|---|---|
predict_code(code, joern_cli, max_nodes=2500, top_k_lines=None) |
function source string | result dict, or None if Joern produced no CPG |
predict_codes(codes, joern_cli, max_nodes=2500, top_k_lines=None) |
list[str] |
list of result dicts (None per entry on Joern failure) |
predict(data, top_k_lines=None) |
a PyG Data object (already built) |
result dict |
predict_from_file(cpg_path, max_nodes=1000, top_k_lines=None) |
path to a Joern CPG file | result dict, or None |
# the everyday call — string in, dict out (Joern handled internally)
result = predictor.predict_code(
"void f(char *s){ char b[8]; strcpy(b, s); }",
joern_cli="C:/joern/joern-cli",
top_k_lines=5,
)
Result dict (schema)
{
"prediction": "CWE-120", # predicted class name
"class_id": 7, # predicted class index
"is_vulnerable": True, # class_id > 0
"confidence": 0.87, # softmax prob of the predicted class [0,1]
"class_probabilities": {"benign": 0.01, "CWE-120": 0.87, ...},
"suspicious_lines": [{"line": 3, "score": 0.92, "code": "strcpy(b, s);"}, ...], # score-desc
"cls_embedding": [0.013, -0.44, ...], # pre-head function vector (for search/drift)
}
suspicious_lines may also carry predicted_cwe + per-line class_probabilities for the
multiclass statement head. cls_embedding is the representation fed to the output head.
Module functions (lower-level)
from gnn_vuln.inference import load_model, predict, predict_from_file
model, class_names = load_model(checkpoint, config, device="cpu") # -> (nn.Module, list[str])
result = predict(model, data, class_names, device=None, top_k_lines=None) # PyG Data -> dict
result = predict_from_file(model, cpg_path, class_names, pretrained_lm=..., ...) # file -> dict
CPG generation — gnn_vuln.data.joern_runner
Only needed if you want the CPG file yourself; predict_code calls this for you.
from gnn_vuln.data.joern_runner import process_function
from pathlib import Path
cpg_path = process_function(
code="int add(int a,int b){return a+b;}", # source string
idx=0,
out_dir=Path("./out"),
joern_cli_dir=Path("C:/joern/joern-cli"),
fmt="graphml", # "graphml" | "json"
lang=None, # None = auto-detect (c/cpp/java/js/py)
) # -> Path to the written CPG, or None on failure
Config — gnn_vuln.config
from gnn_vuln.config import Config
cfg = Config.from_yaml("N48.yaml") # one monolithic file
cfg = Config.from_yamls(["data.yaml", "model.yaml", "train.yaml"]) # split, merged in order
# cfg.data, cfg.model, cfg.train, cfg.ewc, cfg.replay — dataclasses
cfg.data.mode # "binary" | "multiclass"
cfg.model.architecture # "lmgat_codebert" | "lmgat_seqgnn"
cfg.train.epochs # 100
from_yamls lets you split data / model / train configs into separate files; a single file
is just the one-element case (identical behaviour).
Train/val/test split
The split (dataset.get_splits, used by both train + evaluate) is seeded + deterministic.
Control it via config:
cfg.data.train_ratio # 0.8 — seeded split; test ratio = 1 - train - val
cfg.data.val_ratio # 0.1 — e.g. 0.9 / 0.1 → 90/10/0 (no test holdout, prod)
cfg.train.seed # 42 — shuffle seed (reproducible across runs/Python versions)
cfg.data.split_file # "" — path to {"train":[id],"val":[],"test":[]} keyed on parquet_id;
# OVERRIDES the ratios (bring-your-own / match-a-baseline split)
python -m gnn_vuln.train writes <results_dir>/<run>/split.json — the realized train/val/test
parquet_ids — next to training_summary.json, so the exact split is always recoverable.
A 0-ratio test split (e.g. 0.9 / 0.1 → no test) is supported: training + validation run
as usual and the end-of-training test evaluation is skipped (no crash, no test metrics).
Use it for a production model that should train on all labelled data without a holdout.
Data pipeline & training — module CLIs (python -m)
Each step is a runnable module. All accept one config file or several split files (merged section-by-section).
| Command | In | Out |
|---|---|---|
python -m gnn_vuln.data.prepare --input <parquet> --format bigvul --out-dir <dir> --joern-cli <joern> |
raw rows (parquet) | per-function CPGs + cwe_vocab.json |
python -m gnn_vuln.data.build_pt --config <yaml…> --split train |
CPG dir | processed .pt (UniXcoder node features) |
python -m gnn_vuln.data.merge --config <yaml…> --sources <s1> <s2> … --out-source <name> [--dedup] |
built .pts |
one merged .pt (label space unified) |
python -m gnn_vuln.train --config <yaml…> |
.pt + config |
checkpoint + training_summary + split.json |
prepare flags: --binary, --top-cwe N, --sample-per-class N, --workers N.
Installed console scripts: train, evaluate (= python -m gnn_vuln.train / .evaluate).
The whole raw→pt→train flow:
python -m gnn_vuln.data.prepare --input data.parquet --format bigvul --out-dir data/raw --joern-cli <joern>
python -m gnn_vuln.data.build_pt --config config.yaml --split train
python -m gnn_vuln.train --config config.yaml
Evaluation outputs & GNN_VULN_API_MODE
Evaluator separates compute from persistence so a caller can decide what hits disk:
Evaluator.compute() -> EvalResult— runs inference + metrics, returns everything in memory, writes nothing.Evaluator.save_artifacts(res)— research persistence:predictions.csv,localization_scores.csv,metrics_summary.json, ROC / confusion / PR plots.Evaluator.save_summary(res)— writes onlymetrics_summary.json(the small handoff).Evaluator.run()=compute()+save_artifacts()(the research/CLI default).
python -m gnn_vuln.evaluate --checkpoint <pt> runs the full research path. Pass
--metrics-only (or set GNN_VULN_API_MODE=1) to write just metrics_summary.json — for a
service that reads the metrics back and persists them elsewhere, with no bulky per-sample CSVs
or plots on disk.
GNN_VULN_API_MODE=1 also tells the trainer to skip research-only outputs
(training_log.csv, training_curves.png); the small handoffs split.json +
training_summary.json are still written. Set it when embedding the library in a service; leave
it unset for research runs that want the full artifacts for analysis.
Package layout
gnn_vuln/
inference.py VulnPredictor, load_model, predict, predict_from_file
config.py Config (data/model/train/ewc/replay), from_yaml / from_yamls
train.py trainer (python -m gnn_vuln.train)
evaluate.py evaluation (python -m gnn_vuln.evaluate)
models/ lmgat_codebert, lmgat_seqgnn — the architectures (built via config)
data/
prepare.py raw rows → Joern CPG (python -m)
build_pt.py CPG → .pt (python -m)
joern_runner.py process_function — Joern wrapper
dataset_lm.py CodeBERTGraphDataset (PyG InMemoryDataset, UniXcoder features)
node_embedder.py frozen LM per-node embeddings
The library resolves its data/checkpoint root from $GNN_VULN_ROOT (else the current working
directory), so it behaves the same installed-from-PyPI as in a source checkout.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gnn_vuln-0.1.7.tar.gz.
File metadata
- Download URL: gnn_vuln-0.1.7.tar.gz
- Upload date:
- Size: 187.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ec8b504bb28ce651ea6be22d876586cfeacf05c9bd0a5d83011e77b5feb5a79
|
|
| MD5 |
3208fcdabb810d4a7753121f97617fbe
|
|
| BLAKE2b-256 |
f91af8843e85a917e6fb281a771eccc049808fb5e39c1d1cfa57e25f680be134
|
File details
Details for the file gnn_vuln-0.1.7-py3-none-any.whl.
File metadata
- Download URL: gnn_vuln-0.1.7-py3-none-any.whl
- Upload date:
- Size: 224.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d465fc573a9eab025f007b7c7b65fe7ae2214d0fa4a18792fcef7c814f0460bd
|
|
| MD5 |
a3f29aa037b171d2fc5bba5de7bfddb5
|
|
| BLAKE2b-256 |
081ce214e924c4fa59c3272c470bf8dbc92e6615f03df6ace17098844c4447f8
|