AI-native code security scanner with cascade analysis and Firecracker-microVM DAST runtime validation
Project description
Argus
An AI-native code security scanner (Semantic Deep Analysis) that proves exploitability at runtime.
Argus combines a cost-graduated LLM cascade (Gemini Flash-Lite → Sonnet 4.6 → Opus 4.6) with a sandbox tier that executes suspect code in a Firecracker microVM and observes what it actually does. Static-analysis findings get promoted to CONFIRMED only when the sandbox captures concrete runtime evidence — a network call, a file write, a process spawn. Findings that cannot be triggered are marked UNREACHED; findings the file's own defenses block are BLOCKED. No more "the LLM said it might be malicious."
Open source, Apache 2.0, BYOK. You pay your providers directly — Anthropic + Google for the cascade, Fly.io for the optional DAST sandbox. Argus collects nothing.
What makes it different
Most scanners stop at "this code matches a vulnerability pattern." Argus runs the code, watches what it does, and reports per-finding outcomes:
| Status | Meaning |
|---|---|
CONFIRMED |
The sandbox observed the exploit firing at runtime. PoC + event trace are surfaced with the finding. |
BLOCKED |
The attack was tested; the file's own code defended against it (sanitization, escaping, allowlist, etc.). |
UNREACHED |
The attack was tested; the code path is genuinely unreachable. |
NOT_TESTED |
Sandbox couldn't execute the test (with a sub-reason: infra_stub / inconclusive / not_planned). |
vs. other approaches
| Approach | Output | False-positive burden | Evidence type |
|---|---|---|---|
| Pattern-match scanner (regex / AST) | Syntactic match | High | None |
| Single frontier LLM (single-call) | Probabilistic opinion (semantic) | Medium-high | LLM reasoning |
| Argus | Runtime-verified verdict | Low | Sandbox traces |
A CONFIRMED finding looks like this in argus scan output:
{
"cwe": "CWE-200",
"type": "data_exfiltration",
"severity": "critical",
"status": "CONFIRMED",
"confidence": 1.0,
"runtime_evidence": "Mock HTTP server at 127.0.0.1:8000 captured POST body containing
'FAKE_PRIVATE_KEY_CONTENT' and 'ssh-rsa AAAAFAKEKEY user@host'. The malware decoded
its base64 payload (process_exit step=0) and POSTed the contents of ~/.ssh/ to the
rewritten C2 endpoint, exactly as L1's hypothesis predicted.",
"proof_of_concept": "On any Unix host with SSH keys present, execution sends the full
contents of ~/.ssh/ to the remote C2 server over HTTPS."
}
This is Argus's moat. Static and single-LLM scanners report suspicion. Argus reports what the code actually did — with concrete evidence, or clear proof it didn't.
Benchmark — Argus vs frontier single-call scanners
Scored against a ground-truth oracle derived from external security research and a multi-vendor LLM consensus (majority agreement):
Verdict-exact (higher = better)
Argus (cascade + DAST) ████████████████████ 91.3%
Gemini 3.1 Pro █████████████████░░░ 82.6%
Grok 4.3 █████████████████░░░ 82.6%
Opus 4.6 █████████████████░░░ 78.3%
GPT 5.4 ████████████████░░░░ 73.9%
Argus is +13.0pp more accurate than Opus 4.6 and +17.4pp more accurate than GPT-5.4. On the rich-oracle subset Argus also leads on finding quality: CWE F1 0.297 vs Opus 0.180 (+65% lift) and capability F1 0.771 vs Opus 0.720. Mean verdict-distance: 0.087 vs Opus 0.217.
But the differentiator the single-call scanners can't produce is runtime evidence. On the same suite, Argus's DAST tier observed 25 CONFIRMED exploits + 1 BLOCKED with concrete sandbox-captured artefacts — network calls, exfil POST bodies, process traces. By verifying which findings are actually exploitable versus mere pattern matches, Argus minimizes the false-positive flood that drowns security teams using static-only scanners. Unlike single-call LLMs that must guess exploitability, Argus's DAST tier tests it — turning many "maybe" findings into proven CONFIRMED exploits or clean UNREACHED / BLOCKED resolutions.
Methodology + per-file breakdown: bench_results/v1_1_launch/launch_report.md. Re-run is one command: python -m methodology.run_phase_a_report.
How the cascade keeps it cheap
Most files in a real codebase are clean. Argus is built around that observation: spend $0.0001 to dispatch a clean file in 1 second, $0.07 to deep-analyze a suspicious one, and only invoke the sandbox tier on the small subset of files where runtime confirmation actually matters.
File
↓
[$0] Preprocessing hash, deobfuscation, deps, attack-vector flags
↓
[Gemini Flash-Lite] Triage CLEAN | LOW | HIGH ~$0.0001/file
↓
├─ CLEAN → return
├─ LOW → Gemini Flash combined analysis ~$0.02/file
└─ HIGH → Sonnet 4.6 combined analysis ~$0.07/file (default)
↓ borderline / high-stakes
Opus 4.6 deep analysis ~$0.15/file (~20% of HIGH)
↓
[N=3 Sonnet ensemble] borderline-uncertainty path
↓
[DAST sandbox] Sonnet orchestrator + Firecracker microVM
(minimal / networked / ml_tools images)
↓ inconclusive after 2 iterations
Opus iter-3 escalation
↓
[Engine guard] DAST never lowers L1's verdict without
sandbox-grounded refutation
Cost projection per 100 files
| Stage | Calls | API spend |
|---|---|---|
| Triage (Flash-Lite) | 100 | $0.10 |
| LOW analysis (Flash, ~50 files) | 50 | $1.00 |
| HIGH analysis (Sonnet, ~15 files) | 15 | $1.05 |
| HIGH + Opus escalation (~5 files) | 5 | $1.00 |
| Borderline ensemble (Opus, ~3 files) | 3 | $0.60 |
| DAST verification (~3 files) | 3 | $0.90 |
| Total | ~$4.65 |
Hard cost caps (--max-cost <USD> per file, or ScanConfig.max_cost_per_file_usd) abort scans that exceed your declared budget. You'll never get a surprise bill from Argus — the bill comes from your API providers, on a meter you control.
Quick start
pip install argus-ai-scanner
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
argus scan path/to/your/file.py
Requirements:
- Python 3.12+
- An Anthropic API key — console.anthropic.com
- A Google AI Studio key — aistudio.google.com
- Optional: a Fly.io account if you want the DAST sandbox tier (Fly setup runbook)
Single-file scan
# Default: cascade + DAST on confirmed-malicious verdicts
uv run argus scan suspicious_package.py
# Tunable DAST coverage — also DAST suspicious files (~30-50% more API spend)
uv run argus scan suspicious_package.py \
--dast-trigger-verdicts suspicious,malicious,critical_malicious
# Strictest budget mode — DAST only the highest-severity verdict tier
uv run argus scan suspicious_package.py --dast-trigger-verdicts critical_malicious
# Hard cost cap, any verdict
uv run argus scan suspicious_package.py --max-cost 0.50
# Discovery mode — proactive payload sweep for CWEs L1 missed (+~$0.25/file)
uv run argus scan suspicious_package.py --enable-discovery
# Skip DAST entirely (no Fly setup required; cascade-only verdicts)
uv run argus scan suspicious_package.py --no-dast
Repo scan (whole project)
argus scan-repo PATH walks a directory tree, applies file-type and .gitignore filters, and dispatches every supported file through the cascade. For private repos, clone locally first using your existing git credentials, then point Argus at the local path — Argus reads files from disk, not via the GitHub API.
# Whole project, current directory
cd ~/work/my-project
uv run argus scan-repo .
# PR / CI mode — only files changed vs main
uv run argus scan-repo . --diff origin/main
# CI with budget + SARIF output for GitHub Code Scanning
uv run argus scan-repo . \
--diff origin/main \
--max-cost 5.00 \
--output sarif \
--output-file findings.sarif
# Add a custom exclude pattern on top of .gitignore
uv run argus scan-repo . --exclude "vendor/**" --exclude "**/*.generated.*"
What gets scanned: the file-type allowlist covers Python, JavaScript / TypeScript, shell, Java bytecode, Markdown / RST / AsciiDoc (AI-injection surface), HTML / SVG / XML (XSS / XXE), and supply-chain manifests (package.json, requirements.txt, Cargo.lock, go.mod, Gemfile, composer.json, etc.). AI-agent config sentinels (CLAUDE.md, AGENTS.md, .cursorrules, mcp.json, claude_desktop_config.json, devcontainer.json, …) are explicitly recognized — these are the prime vectors for malicious-instructions-in-config attacks against coding agents. Always-ignored: .git, node_modules, __pycache__, .venv, build dirs, etc.
Output formats: --output markdown (default; human summary) / json (full per-file results) / sarif (SARIF v2.1.0 JSON, uploadable to GitHub Code Scanning).
DAST sandbox tier
DAST is optional. Without it, Argus ships verdicts using the L1 cascade alone. With it, you get per-finding CONFIRMED / BLOCKED / UNREACHED evidence backed by real runtime traces.
When enabled, every DAST plan runs in an ephemeral Firecracker microVM (Fly.io managed). The orchestrator:
- Reads L1's hypotheses about how the file might be exploitable
- Generates a concrete plan — sandbox commands, expected oracle, image hint
- Submits to the microVM, which runs the file with file-content materialized at
/workspace/<basename>, captures network calls via DNS hijack, and emits a structured event stream - Reads back the events, scores each hypothesis as
CONFIRMED/BLOCKED/UNREACHED/NOT_TESTED - Surfaces the captured evidence (
runtime_evidencefield per finding)
Three sandbox images cover most workloads:
| Image | Contents | Use cases |
|---|---|---|
minimal-v1 |
Python 3.13 + Node.js + npm + JRE + bash + curl | Pickle exploits, file I/O, subprocess, basic crypto |
networked-v1 |
minimal + curl / wget / nc / dig / openssl | Exfiltration confirmation via real DNS / network captures |
ml_tools-v1 |
networked + torch CPU + transformers + safetensors | Malicious model loaders, pickled __reduce__ payloads |
Multi-language coverage today: Python, JavaScript / TypeScript, bash, Java bytecode. Roadmap: Go, Rust, Java source (compile required), .NET.
Full setup: docs/dast-setup.md.
Privacy
Files you scan never leave your machine in two-tier (no DAST) mode. With DAST enabled, file content is shipped (gzip + base64) to your own Fly app over the Fly machines API — nothing is routed through any Argus-operated infrastructure.
Argus has no telemetry, no opt-in analytics, and no usage reporting. The CLI does not phone home.
Architecture invariants
The non-negotiable design rules — break these in a PR and expect a long review:
- Preprocessing is deterministic and free. No model calls in
preprocessing/. If you're tempted, the change belongs inanalysis/. - The cascade short-circuits cheap files cheap. A clean file costs $0.0001 (triage only). Don't add expensive defaults.
- All runners are injectable.
scan_file(triage_runner=, sonnet_runner=, opus_runner=, dast_runner=). The engine never hard-codes provider calls — that's how unit tests run with no API spend. - DAST never silently lowers an L1 verdict. A
malicious→suspiciousdowngrade only fires when every L1 finding is sandbox-grounded asBLOCKEDorUNREACHED. Without that, L1's verdict stands anddast_keep_l1is recorded. - Cost guardrails enforced before any user-facing release.
max_cost_per_file_usdaborts mid-scan rather than overrun.
Documentation
| Topic | Page |
|---|---|
| Install + first scan | docs/install.md |
| API key sourcing | docs/api-keys.md |
| Cascade architecture | docs/architecture.md |
| Cost guide + budget knobs | docs/cost-guide.md |
| DAST sandbox setup (Fly.io) | docs/dast-setup.md |
| Contributing | CONTRIBUTING.md |
| Security disclosures | SECURITY.md |
Development
uv sync --extra dev
uv run pytest tests/unit -v # always; no API spend
uv run ruff check . && uv run ruff format .
uv run mypy --strict .
uv run pytest tests/integration -v # before runner / engine changes; spends API
Argus is Python 3.12+, mypy --strict, ruff for lint and format. Pydantic v2 for cross-boundary structures. structlog for logging — never print() outside the CLI. Tests are split into unit (mocked, mandatory) and integration (live API, optional + manual). CI runs unit + lint + types on every PR; integration tests stay local because nobody wants surprise API bills on their fork.
See CONTRIBUTING.md for the full PR process.
License
Copyright © 2026 David Shochat and contributors.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file argus_ai_scanner-1.1.0.tar.gz.
File metadata
- Download URL: argus_ai_scanner-1.1.0.tar.gz
- Upload date:
- Size: 419.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f44c285c35152893b14a71b69c5f2c0b25f1c1cf5f744b883417f539c6156f31
|
|
| MD5 |
78399765160f25c150de14ef2f974ac9
|
|
| BLAKE2b-256 |
1f279fefef82c58848ddcdd0de9c70ddd39e2632cbf6b2cf3e3fbe497aa391c4
|
Provenance
The following attestation bundles were made for argus_ai_scanner-1.1.0.tar.gz:
Publisher:
release.yml on dshochat/Argus_Scanner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
argus_ai_scanner-1.1.0.tar.gz -
Subject digest:
f44c285c35152893b14a71b69c5f2c0b25f1c1cf5f744b883417f539c6156f31 - Sigstore transparency entry: 1455843077
- Sigstore integration time:
-
Permalink:
dshochat/Argus_Scanner@eef83b8dbef4d78bf48b74babb9239bd5d4ce763 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/dshochat
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@eef83b8dbef4d78bf48b74babb9239bd5d4ce763 -
Trigger Event:
push
-
Statement type:
File details
Details for the file argus_ai_scanner-1.1.0-py3-none-any.whl.
File metadata
- Download URL: argus_ai_scanner-1.1.0-py3-none-any.whl
- Upload date:
- Size: 363.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
924dc5c23e69b783d113a4c4b5be1879ef4bb76638f704793dbcbc4c38eec160
|
|
| MD5 |
d51e4d9baa2291ec901be3f25a73f4a1
|
|
| BLAKE2b-256 |
150b6e32c6fe3de924342ec1156e15f1f301649199503975d3ed7cb76f6380cd
|
Provenance
The following attestation bundles were made for argus_ai_scanner-1.1.0-py3-none-any.whl:
Publisher:
release.yml on dshochat/Argus_Scanner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
argus_ai_scanner-1.1.0-py3-none-any.whl -
Subject digest:
924dc5c23e69b783d113a4c4b5be1879ef4bb76638f704793dbcbc4c38eec160 - Sigstore transparency entry: 1455843175
- Sigstore integration time:
-
Permalink:
dshochat/Argus_Scanner@eef83b8dbef4d78bf48b74babb9239bd5d4ce763 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/dshochat
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@eef83b8dbef4d78bf48b74babb9239bd5d4ce763 -
Trigger Event:
push
-
Statement type: