Skip to main content

GNN-based vulnerability detection for code — Final Project (Tugas Akhir)

Project description

gnn_vuln — Library API Reference

The installable model library behind the vulnerability-detection service. This is the complete public surface: what to import, the inputs, and the outputs.

Not everything is file-based. You pass a function source string and get a result dict back. The only files involved are the model checkpoint + config (normal — weights and config live on disk) and the Joern CPG, which is created in a private temp dir and hidden from you. In-memory in, in-memory out.


Install

# 1. torch + PyG sparse ext from their own indexes (PyPI can't resolve these alone)
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu     # or cu124
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.6.0+cpu.html
# 2. the library
pip install gnn-vuln

Plus Joern (CPG generation) + a JDK 21 on the host. Point the predictor at the joern-cli directory.


Inference — gnn_vuln.inference

VulnPredictor (high-level, recommended)

from gnn_vuln.inference import VulnPredictor

predictor = VulnPredictor.from_checkpoint(
    checkpoint="checkpoints/<run>/best_model.pt",   # trained weights (.pt file)
    config="configs/<arch>/config.yaml",            # its config (file, or pass a list)
    device="cuda",                                  # "cpu" | "cuda"
)
predictor.class_names = ["benign", "CWE-787", ...]  # optional: override label names
Method Input Output
predict_code(code, joern_cli, max_nodes=2500, top_k_lines=None) function source string result dict, or None if Joern produced no CPG
predict_codes(codes, joern_cli, max_nodes=2500, top_k_lines=None) list[str] list of result dicts (None per entry on Joern failure)
predict(data, top_k_lines=None) a PyG Data object (already built) result dict
predict_from_file(cpg_path, max_nodes=1000, top_k_lines=None) path to a Joern CPG file result dict, or None
# the everyday call — string in, dict out (Joern handled internally)
result = predictor.predict_code(
    "void f(char *s){ char b[8]; strcpy(b, s); }",
    joern_cli="C:/joern/joern-cli",
    top_k_lines=5,
)

Result dict (schema)

{
  "prediction":          "CWE-120",          # predicted class name
  "class_id":            7,                   # predicted class index
  "is_vulnerable":       True,                # class_id > 0
  "confidence":          0.87,                # softmax prob of the predicted class [0,1]
  "class_probabilities": {"benign": 0.01, "CWE-120": 0.87, ...},
  "suspicious_lines":    [{"line": 3, "score": 0.92, "code": "strcpy(b, s);"}, ...],  # score-desc
  "cls_embedding":       [0.013, -0.44, ...], # pre-head function vector (for search/drift)
}

suspicious_lines may also carry predicted_cwe + per-line class_probabilities for the multiclass statement head. cls_embedding is the representation fed to the output head.

Module functions (lower-level)

from gnn_vuln.inference import load_model, predict, predict_from_file

model, class_names = load_model(checkpoint, config, device="cpu")   # -> (nn.Module, list[str])
result = predict(model, data, class_names, device=None, top_k_lines=None)   # PyG Data -> dict
result = predict_from_file(model, cpg_path, class_names, pretrained_lm=..., ...)  # file -> dict

CPG generation — gnn_vuln.data.joern_runner

Only needed if you want the CPG file yourself; predict_code calls this for you.

from gnn_vuln.data.joern_runner import process_function
from pathlib import Path

cpg_path = process_function(
    code="int add(int a,int b){return a+b;}",  # source string
    idx=0,
    out_dir=Path("./out"),
    joern_cli_dir=Path("C:/joern/joern-cli"),
    fmt="graphml",         # "graphml" | "json"
    lang=None,             # None = auto-detect (c/cpp/java/js/py)
)   # -> Path to the written CPG, or None on failure

Config — gnn_vuln.config

from gnn_vuln.config import Config

cfg = Config.from_yaml("N48.yaml")                              # one monolithic file
cfg = Config.from_yamls(["data.yaml", "model.yaml", "train.yaml"])  # split, merged in order
# cfg.data, cfg.model, cfg.train, cfg.ewc, cfg.replay  — dataclasses
cfg.data.mode          # "binary" | "multiclass"
cfg.model.architecture # "lmgat_codebert" | "lmgat_seqgnn"
cfg.train.epochs       # 100

from_yamls lets you split data / model / train configs into separate files; a single file is just the one-element case (identical behaviour).

Train/val/test split

The split (dataset.get_splits, used by both train + evaluate) is seeded + deterministic. Control it via config:

cfg.data.train_ratio   # 0.8  — seeded split; test ratio = 1 - train - val
cfg.data.val_ratio     # 0.1  — e.g. 0.9 / 0.1 → 90/10/0 (no test holdout, prod)
cfg.train.seed         # 42   — shuffle seed (reproducible across runs/Python versions)
cfg.data.split_file    # ""   — path to {"train":[id],"val":[],"test":[]} keyed on parquet_id;
                       #        OVERRIDES the ratios (bring-your-own / match-a-baseline split)

python -m gnn_vuln.train writes <results_dir>/<run>/split.json — the realized train/val/test parquet_ids — next to training_summary.json, so the exact split is always recoverable.

A 0-ratio test split (e.g. 0.9 / 0.1 → no test) is supported: training + validation run as usual and the end-of-training test evaluation is skipped (no crash, no test metrics). Use it for a production model that should train on all labelled data without a holdout.


Data pipeline & training — module CLIs (python -m)

Each step is a runnable module. All accept one config file or several split files (merged section-by-section).

Command In Out
python -m gnn_vuln.data.prepare --input <parquet> --format bigvul --out-dir <dir> --joern-cli <joern> raw rows (parquet) per-function CPGs + cwe_vocab.json
python -m gnn_vuln.data.build_pt --config <yaml…> --split train CPG dir processed .pt (UniXcoder node features)
python -m gnn_vuln.data.merge --config <yaml…> --sources <s1> <s2> … --out-source <name> [--dedup] built .pts one merged .pt (label space unified)
python -m gnn_vuln.train --config <yaml…> .pt + config checkpoint + training_summary + split.json

prepare flags: --binary, --top-cwe N, --sample-per-class N, --workers N. Installed console scripts: train, evaluate (= python -m gnn_vuln.train / .evaluate).

The whole raw→pt→train flow:

python -m gnn_vuln.data.prepare  --input data.parquet --format bigvul --out-dir data/raw --joern-cli <joern>
python -m gnn_vuln.data.build_pt --config config.yaml --split train
python -m gnn_vuln.train         --config config.yaml

Package layout

gnn_vuln/
  inference.py            VulnPredictor, load_model, predict, predict_from_file
  config.py               Config (data/model/train/ewc/replay), from_yaml / from_yamls
  train.py                trainer  (python -m gnn_vuln.train)
  evaluate.py             evaluation (python -m gnn_vuln.evaluate)
  models/                 lmgat_codebert, lmgat_seqgnn — the architectures (built via config)
  data/
    prepare.py            raw rows → Joern CPG            (python -m)
    build_pt.py           CPG → .pt                       (python -m)
    joern_runner.py       process_function — Joern wrapper
    dataset_lm.py         CodeBERTGraphDataset (PyG InMemoryDataset, UniXcoder features)
    node_embedder.py      frozen LM per-node embeddings

The library resolves its data/checkpoint root from $GNN_VULN_ROOT (else the current working directory), so it behaves the same installed-from-PyPI as in a source checkout.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gnn_vuln-0.1.5.tar.gz (185.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gnn_vuln-0.1.5-py3-none-any.whl (222.1 kB view details)

Uploaded Python 3

File details

Details for the file gnn_vuln-0.1.5.tar.gz.

File metadata

  • Download URL: gnn_vuln-0.1.5.tar.gz
  • Upload date:
  • Size: 185.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.3

File hashes

Hashes for gnn_vuln-0.1.5.tar.gz
Algorithm Hash digest
SHA256 afe5d1a9b6a3953070091ec4edb7911c6e6e684c2800c9e240c298e7e7f7ffd4
MD5 5e1ebaa7e1ccea2d30f67c84aebb49e7
BLAKE2b-256 7dd8199c06c92a492b3be1ce14a771155941e8127c93bf0815acd63b30db2163

See more details on using hashes here.

File details

Details for the file gnn_vuln-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: gnn_vuln-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 222.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.3

File hashes

Hashes for gnn_vuln-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 84100eb5f7711ac08151d6c5f209a7fa0716e010f8980f0b03650d9179bc8493
MD5 9562adefcd0780c7bc9ecf6ecf894455
BLAKE2b-256 41995a8770cc939768b47cdbbdacc04b259b2ec569990e53b83dc047841adf28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page