Lightweight Python runner that interdicts suspicious startup behavior.
Project description
pydepgate
A lightweight Python runner that interdicts suspicious startup behavior.
pydepgate inspects Python packages and environments for code that executes silently at interpreter startup. This was the attack class used by the March 2026 LiteLLM supply-chain compromise and catalogued as MITRE ATT&CK T1546.018.
Status
Static analysis is functional end-to-end.
pydepgate can statically analyze wheels, sdists, installed packages, and single loose files for the patterns used in real-world Python supply-chain attacks. The detection covers payload encoding, dynamic code execution, string obfuscation, suspicious stdlib usage, and a broad code-density layer that catches obfuscation, Unicode trickery, and machine-generated identifier patterns.
What works today:
- Static analysis of
.whlfiles, sdists (.tar.gz/.tgz/etc.), installed packages by name, and individual loose files via--single. - Five production analyzers:
encoding_abuse,dynamic_execution,string_ops,suspicious_stdlib, andcode_density. - Defense in depth on real attack shapes. The LiteLLM 1.82.8
.pthpayload, for example, fires across four analyzers simultaneously (ENC001, DYN002, DENS010, DENS011) so an attacker has to evade every layer to get past the scanner. - A rules engine that promotes severity based on file kind and signal context, fully data-driven via TOML or JSON. The default rule set includes 32 rules dedicated to density-layer signals alone.
- A safe partial evaluator that resolves obfuscated string expressions without executing user code.
- An SSH-randomart-style finding-distribution map rendered inline with human-readable scan output, showing where in a file the findings cluster and at what severity.
- Command-line interface with
scan(including--singlefor iteration on individual files) andexplainsubcommands, environment variable support, configurable severity thresholds, and CI-friendly output modes. - Three output formats: human-readable terminal, JSON, and a stub for SARIF (planned for v0.5).
What is in active development:
- The
comment_analysisanalyzer. - Runtime interdiction (
execmode). - Environment auditing (
preflightmode).
Available on PyPI as pydepgate.
The problem
Python's interpreter runs several kinds of code automatically at startup, before any user script executes:
.pthfiles insite-packages/. Any line beginning withimportis passed toexec()bysite.pyduring interpreter initialization.sitecustomize.pyandusercustomize.py. Imported automatically if present.__init__.pytop-level code in any imported package.setup.py. Executed duringpip installfor source distributions.- Console-script entry points. Generated and executed by
pip install.
Each of these is a legitimate Python feature. Each has been used in
real-world supply-chain attacks. Existing Python security tooling
(pip-audit, safety, bandit) does not inspect these startup vectors.
The .pth vector in particular has been acknowledged as a security gap
in CPython issue #113659
but has no patch.
Installation
pip install pydepgate
Requires Python 3.11 or later. No third-party runtime dependencies.
Usage
Scan a wheel:
pydepgate scan some-package-1.0.0-py3-none-any.whl
Scan a source distribution:
pydepgate scan some-package-1.0.0.tar.gz
Scan an installed package by name:
pydepgate scan litellm
Scan a single loose file (useful for iterating on test fixtures, ad-hoc inspection of a suspicious file, or reproducing a finding without restructuring the file into a package):
pydepgate scan --single suspicious_module.py
pydepgate scan --single fixture.pth
pydepgate scan --single garbage.py --as init_py
--single bypasses wheel/sdist/installed-package dispatch and analyzes
the file directly. The file kind is auto-detected from the filename:
.pth files are treated as pth; files named setup.py,
__init__.py, sitecustomize.py, or usercustomize.py are classified
as their natural kind; anything else defaults to setup_py (the most
permissive context, ideal for surfacing every signal at realistic
attack-shape severity). Override with --as:
setup_py / init_py / pth / sitecustomize / usercustomize.
Explain what a signal means and what triggers it:
pydepgate explain STDLIB001
pydepgate explain DENS010
pydepgate explain --rule litellm-pth-stdlib
pydepgate explain --list
In CI, use --ci for compact JSON output and proper exit codes:
pydepgate --ci scan some-package.whl
Filter findings by severity:
pydepgate scan some-package.whl --min-severity high
Apply a custom rules file:
pydepgate scan some-package.whl --rules-file company-rules.gate
Scan an entire library archive:
Recommend using --min-severity as this is noisy by design.
pydepgate scan --deep somefile.whl
Exit codes
0Clean. No findings (or no findings above--min-severity).1Findings present, but none HIGH or CRITICAL.2At least one HIGH or CRITICAL finding.3Tool error. pydepgate could not complete the scan.
These are stable as part of the v0.1+ contract.
Environment variables
All flags can be set via environment variables. Explicit flags override environment values.
| Variable | Equivalent flag |
|---|---|
PYDEPGATE_CI |
--ci |
PYDEPGATE_FORMAT |
--format |
PYDEPGATE_NO_COLOR (or NO_COLOR) |
--no-color |
PYDEPGATE_MIN_SEVERITY |
--min-severity |
PYDEPGATE_STRICT_EXIT |
--strict-exit |
PYDEPGATE_RULES_FILE |
--rules-file |
What pydepgate detects
The current analyzer set covers five major classes of suspicious behavior in startup vectors:
Encoding abuse (ENC001). Patterns where encoded content is decoded and executed in a single chain, e.g. exec(base64.b64decode(payload)). Catches base64, hex, codec-based, zlib, bz2, lzma, and gzip variants.
Dynamic execution (DYN001-007). Direct calls to exec, eval, compile, or __import__; access to exec primitives via getattr, globals(), locals(), vars(), or __builtins__ subscripts; compile-then-exec across the file; and aliased call shapes that catch e = exec; e(...) evasions.
String obfuscation (STR001-004). Obfuscated string expressions that resolve to the names of exec primitives, dangerous stdlib functions, or sensitive module names. Uses a safe partial evaluator that statically computes what string an expression would produce, without executing user code. Catches:
- Concatenation:
'ev' + 'al' - Character codes:
chr(101) + chr(118) + chr(97) + chr(108) - Slicing:
'lave'[::-1] str.joinof literal pieces:''.join(['e','v','a','l'])bytes.fromhex('6576616c').decode()- f-string assembly with literal interpolation
- Single-assignment variables containing obfuscated values
Suspicious stdlib usage (STDLIB001-003). Calls to stdlib functions that are highly unusual in startup vectors:
STDLIB001: process spawn (os.system,subprocess.Popen,subprocess.run,os.exec*, etc.)STDLIB002: network operations (urllib.request.urlopen,socket.socket,http.client, etc.)STDLIB003: native code loading (ctypes.CDLL,ctypes.WinDLL, etc.)
Confidence is HIGH by default. The rules engine promotes these to CRITICAL when they appear in setup.py or in a .pth file (where they have no legitimate business existing). This is the rule that fires on LiteLLM 1.82.8.
The "harder they hide it the stronger the signal" model is realized through operation counting: an expression that required many obfuscation operations to assemble a sensitive name is treated as more confidently malicious than one that required few.
Code density (DENS001-051). A broad layer that catches the things obfuscated code looks like even when no single primitive call is suspicious on its own. Thirteen distinct signals across five sublayers:
Lexical (line-shape):
DENS001: single-line token compression (minification or bundler-mimicry shapes)DENS002: semicolon chaining of multiple statements on one line
String content:
DENS010: high-entropy string literals (Shannon entropy consistent with base64, compressed, or encrypted content)DENS011: literals using only base64-alphabet characters, even without an accompanying decode call
Identifier shape:
DENS020: low-vowel-ratio identifiers (machine-generated or deliberately mangled names like_xkjwbq)DENS021: confusable single-character identifiers (l,O,I)
Unicode:
DENS030: invisible Unicode characters in source (zero-width spaces, RTL overrides; the Trojan Source class catalogued as CVE-2021-42574)DENS031: Unicode homoglyphs in identifiers (Cyrillic and Greek lookalikes used to evade string-match scanners)
Structural:
DENS040: AST depth disproportionate to line count (compression hidden inside expression trees)DENS041: deeply nested lambdas or comprehensions (functional-style obfuscation)DENS042: large byte-range integer arrays (122-element lists of 0-255 ints, the shellcode-staging shape)
Docstring:
DENS050: high-entropy docstrings (the docstring-as-payload smuggling pattern)DENS051: dynamic__doc__reference passed to a callable (the runtime decode-and-execute half of the smuggling pattern)
The default rule set ships 32 rules covering these signals across five file kinds, calibrated so that the same content scans differently depending on where it lives. A high-entropy base64 literal in .pth is CRITICAL (no benign use case); the same literal in __init__.py is MEDIUM (some packages legitimately ship encoded blobs); the same literal anywhere else is LOW (UUIDs and hashes happen). DENS021 is universally INFO because PEP-8-style confusables aren't a security finding by themselves; they only matter as a contributing signal when other signals fire.
Layered detection in practice
The LiteLLM 1.82.8 .pth payload is a single line:
import base64; exec(base64.b64decode('cHJpbnQoMSkK'))
A scanner that grepped for exec would catch it. A scanner that grepped for base64.b64decode would catch it. But an attacker who knew about either of those evasions could trivially defeat both. pydepgate fires five separate findings on this line from four independent analyzers:
- ENC001 (encoding_abuse): decode-then-execute pattern
- DYN002 (dynamic_execution):
exec()with non-literal argument at module scope - DENS001 (code_density): token-dense single line
- DENS010 (code_density): high-entropy string literal
- DENS011 (code_density): base64-alphabet string literal
Plus the rule layer promotes all of them to CRITICAL because the file is a .pth. To evade pydepgate, an attacker has to defeat every analyzer simultaneously while still producing a working .pth payload. Each evasion narrows what's possible; the intersection of all evasions is the empty set for any shape that could realistically execute on Python startup.
The rules engine
Analyzers emit raw signals. The rules engine maps signals to severity-rated findings using a data-driven rule set. Default rules ship in JSON; users can override or augment them with a pydepgate.gate file (TOML or JSON, auto-detected) in the project root, the venv root, or specified via --rules-file.
A rule is a small structured object:
{
"id": "litellm-pth-stdlib",
"match": {
"signal_id": "STDLIB001",
"file_kind": "pth"
},
"actions": [
{"type": "set_severity", "severity": "critical"}
]
}
Three actions are supported: set_severity, suppress, and set_description. User rules always take precedence over default rules, regardless of specificity. Suppressed findings are tracked separately so users can see what would have fired and why it didn't.
Run pydepgate explain --list to see all default rules and signals, with descriptions of what they catch and how rules promote them.
Writing rules
Rules live in pydepgate.gate files. The format is either TOML or
JSON; pydepgate auto-detects from content. A rule has three parts:
identity (an id), a match (which signals it applies to), and an
action (what to do when matched).
Discovery
When you run pydepgate scan, rules are loaded from the first match
of:
- The
--rules-fileCLI flag, if given. - The
PYDEPGATE_RULES_FILEenvironment variable. ./pydepgate.gatein the current directory.<venv>/pydepgate.gatein the active virtualenv, if any. If multiple files exist, only the first is loaded. The others are listed in the scan summary so you can see what was skipped.
Minimal rule (TOML)
[[rule]]
id = "my-package-uses-large-base64"
signal_id = "DENS010"
path_glob = "my_package/embedded/*.py"
action = "suppress"
explain = "We legitimately ship a 200KB embedded model in this dir."
The id is yours. signal_id is what to match (see
pydepgate explain --list for the catalogue). path_glob is an
fnmatch-style pattern matched against the internal path of the file.
action is one of set_severity, suppress, or set_description.
explain is optional but encouraged: it shows up in
pydepgate explain --rule USER_my-package-uses-large-base64.
Match conditions
All non-empty match fields must be satisfied for a rule to apply. The supported fields:
| Field | Matches against |
|---|---|
signal_id |
Signal.signal_id (e.g. "DENS010") |
analyzer |
Signal.analyzer (e.g. "code_density") |
file_kind |
The triage decision: pth, setup_py, init_py, sitecustomize, library_py, etc. |
scope |
Signal.scope: module, function, class |
path_glob |
fnmatch pattern against the file's internal path |
context_contains |
Dict of {key: value} pairs that must appear in Signal.context with strict equality |
context_predicates |
Dict of {key: {operator: value}} pairs evaluated against Signal.context (richer than context_contains, see below) |
Context predicates
context_predicates extends context_contains with comparison
operators. Each predicate takes the form {field: {op: value}}. The
inner dict has exactly one operator key. Multiple predicates on
different fields are AND-ed.
# Block any base64-shaped string of 10KB or larger anywhere
[[rule]]
id = "block-large-base64"
signal_id = "DENS010"
context_predicates = { length = { gte = 10240 } }
action = "set_severity"
severity = "critical"
# Suppress confusable single-char identifiers in test files only
[[rule]]
id = "ignore-confusables-in-tests"
signal_id = "DENS021"
path_glob = "tests/**/*.py"
context_predicates = { identifier = { in = ["l", "O", "I"] } }
action = "suppress"
Available operators:
| Category | Operators | Value type |
|---|---|---|
| Numeric | eq, ne, gt, gte, lt, lte |
int or float |
| String | eq, ne, contains, startswith, endswith |
string |
| Collection | in, not_in |
list, tuple, or set |
Type mismatches (e.g. gte against a string) cause the predicate to
silently fail to match rather than error. To AND multiple conditions
on the same field, write multiple rules.
Equivalent JSON
{
"_pydepgate_format": "json",
"_pydepgate_version": 1,
"rules": [
{
"id": "block-large-base64",
"signal_id": "DENS010",
"context_predicates": {"length": {"gte": 10240}},
"action": "set_severity",
"severity": "critical"
}
]
}
Actions
set_severity: requires aseverityfield (info,low,medium,high,critical).suppress: drop the finding from the scan output. The suppression is still recorded;pydepgate scan -vshows what was suppressed and which rule did it.set_description: requires adescriptionfield; replaces the finding's text.
Precedence
When multiple rules match a signal, pydepgate picks one winner using this order:
- Source priority: user rules win over system rules win over defaults, regardless of specificity.
- Specificity: among rules of the same source, more match fields
wins. Each
context_predicatesentry counts as one match field. - Load order: among ties on source and specificity, the earlier
rule wins.
This means a user
[[rule]]with the same shape as a default rule always wins. If you want your rule to lose to a more-specific default, add fewer match fields than the rule you want to override.
Validation
Rules are validated when loaded. Errors are accumulated and reported together; if any rule fails validation, the entire file is rejected (no rules loaded). Common errors:
- Unknown field name:
Did you mean 'context_contains'? - Unknown operator:
Did you mean 'gte'? - Multiple operators in one predicate: must be exactly one per field.
- Missing
severityforset_severityaction. Runpydepgate scan --rules-file my.gateonce after editing to confirm everything parses.
Design constraints
- Zero runtime dependencies. Standard library only. This is a load-bearing design constraint, not a stylistic preference: every additional dependency is a supply-chain attack surface for a tool whose job is to defend against supply-chain attacks.
- Safe by construction. Parsers and the partial evaluator never execute, compile, or import input content. Every operation modeled by the resolver is reimplemented from scratch using only Python builtins on values the resolver itself produced.
- Self-integrity at bootstrap. Critical stdlib references are captured into locals before any untrusted code runs (relevant when the runtime engine ships in v0.4).
- Lightweight. The full test suite runs in roughly seven seconds on a modern laptop, including subprocess-based CLI tests against installed packages.
Relationship to PyDepGuard
pydepgate is a narrow, single-purpose tool focused on startup-vector interdiction. PyDepGuard is a broader Python security framework covering runtime sandboxing, and dependency management. The startup-vector engine developed in pydepgate is intended to eventually integrate with PyDepGuard as a subsystem; until then, the two projects are developed independently.
Users who need only startup-vector protection should use pydepgate. Users who need the full runtime security model should use PyDepGuard directly.
Architecture
The codebase is organized as a layered pipeline:
parsers/ bytes -> structured representations (pth, pysource, wheel, sdist)
introspection/ installed package enumeration via importlib.metadata
traffic_control/ path-based triage; decides what to analyze
analyzers/ structured representations -> raw signals
_resolver.py safe partial evaluator (shared infrastructure)
_visitor.py scope tracking and AST utilities (shared)
encoding_abuse ENC001
dynamic_execution DYN001-007
string_ops STR001-004
suspicious_stdlib STDLIB001-003
density_analyzer DENS001-051
rules/ signals + context -> severity-rated findings
base.py rule data model and matching logic
defaults.py default rule set (90+ rules across all signals)
loader.py TOML/JSON parser with validation and typo suggestions
explanations.py structured explain-output content
engines/ orchestration (currently: static)
visualizers/ inline rendering helpers for the human reporter
density_map SSH-randomart-style finding-distribution renderer
cli/ argparse, dispatch, reporters, explain subcommand
Analyzers do not see raw bytes. They walk parsed representations and emit Signal objects. The rules engine wraps signals with severity to produce Finding objects, applying user and default rules in priority order. The CLI renders findings in human, JSON, or SARIF format.
The _resolver.py module is reusable infrastructure for any analyzer that needs to know what an expression evaluates to. It returns structured ResolutionResult objects with success/failure status, operation counts, partial values, and resolved fragment lists.
The static engine exposes three entry points for single-file analysis. scan_file(path) reads bytes and routes through triage by filename. scan_bytes(content, internal_path, ...) is the per-file workhorse that artifact enumerators (wheel, sdist, installed) call once per in-scope file. scan_loose_file_as(path, file_kind) bypasses triage entirely and forces a file kind, preserving the real path through to finding contexts; this is the entry point used by pydepgate scan --single.
Development
git clone https://github.com/nuclear-treestump/pydep-vector-runner
cd pydep-vector-runner
pip install -e .
python -m unittest discover tests -v
The test suite has grown to approximately 500 tests as the analyzer set has expanded. Tests are organized by module and include happy-path coverage, evasion batteries, false-positive batteries, robustness checks against adversarial inputs, integration tests against synthetic wheels and sdists, and CLI tests via subprocess.
To regenerate the binary .pth test fixtures after editing them:
python scripts/generate_fixtures.py
Safety notes
This project builds tooling to defend against Python supply-chain attacks. The test fixtures in tests/fixtures/ and the synthetic samples used in integration tests model the structural shape of known attacks (LiteLLM 1.82.8, Trojan Source CVE-2021-42574, others catalogued under T1546.018) but contain only inert payloads. No actual malicious code is present in this repository.
For regression testing against real malicious samples, use the OSSF malicious-packages, Datadog malicious-software-packages-dataset, or lxyeternal/pypi_malregistry datasets. Do so in disposable VMs or containers, and do not commit samples to this repository.
Known limitations
pydepgate's static analysis is honest about what it can and cannot catch. Documented gaps include:
Analysis gaps:
- Function return tracking.
code = make_payload()wheremake_payload()internally callscompile(...)is not flagged. __builtins__as a Name subscript (rather than via a function call).- Tuple unpacking, augmented assignment, and conditional assignments in the resolver's variable tracking.
- Lambda scope precision (lambdas count as their enclosing scope).
- Aliased stdlib imports such as
from subprocess import Popen as P.- For now. I will add this soon enough.
Density-layer caveats:
DENS020(low-vowel-ratio identifiers) andDENS040(AST depth) both produce false positives on legitimate machine-generated code (Cython output, parser tables, generated configuration). They ship atLOWseverity outside startup vectors so they surface as contributing signals rather than standalone alerts.DENS031(homoglyphs) can fire on legitimate non-English variable names in non-Latin codebases. The default rule keeps it atHIGHrather thanCRITICALoutside startup vectors so users with intentional non-Latin naming can suppress with a single user rule.
Author
Ikari (@0xIkari)
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydepgate-0.1.7.tar.gz.
File metadata
- Download URL: pydepgate-0.1.7.tar.gz
- Upload date:
- Size: 130.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee96504ae123efd94aa52283b640fda26c01ecd8e2cac4805bbec855183cd7a0
|
|
| MD5 |
43eed6e90ccba93678af80f4484f11dc
|
|
| BLAKE2b-256 |
87994d2462c3f0c1d2e566956b34038fabf9c1dac253c630fc10957b6b2dd313
|
Provenance
The following attestation bundles were made for pydepgate-0.1.7.tar.gz:
Publisher:
python-publish.yml on nuclear-treestump/pydep-vector-runner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pydepgate-0.1.7.tar.gz -
Subject digest:
ee96504ae123efd94aa52283b640fda26c01ecd8e2cac4805bbec855183cd7a0 - Sigstore transparency entry: 1396990657
- Sigstore integration time:
-
Permalink:
nuclear-treestump/pydep-vector-runner@3260d64a106c01f6d7f56fd6639cb908faecc836 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nuclear-treestump
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@3260d64a106c01f6d7f56fd6639cb908faecc836 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file pydepgate-0.1.7-py3-none-any.whl.
File metadata
- Download URL: pydepgate-0.1.7-py3-none-any.whl
- Upload date:
- Size: 139.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fde9f5c41e0223633b313f4c602b782da17ad9620993fb94679995bb990ec036
|
|
| MD5 |
16492783a4c00f0356ddc3757d6e813f
|
|
| BLAKE2b-256 |
afeb83efd56c77a31fc2e45be48f2c39281878234ed50ea49193ac004501cf34
|
Provenance
The following attestation bundles were made for pydepgate-0.1.7-py3-none-any.whl:
Publisher:
python-publish.yml on nuclear-treestump/pydep-vector-runner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pydepgate-0.1.7-py3-none-any.whl -
Subject digest:
fde9f5c41e0223633b313f4c602b782da17ad9620993fb94679995bb990ec036 - Sigstore transparency entry: 1396990666
- Sigstore integration time:
-
Permalink:
nuclear-treestump/pydep-vector-runner@3260d64a106c01f6d7f56fd6639cb908faecc836 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nuclear-treestump
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@3260d64a106c01f6d7f56fd6639cb908faecc836 -
Trigger Event:
workflow_dispatch
-
Statement type: