Extract and analyze printable strings from binary files for malware analysis and forensics
Project description
String Analyzer
String extraction for CTI and malware-analysis workflows: surface URLs, IPs, paths, registry keys, APIs, commands, encoded data, and analyst-ready prompts from binaries and memory artifacts.
CTI Use
Use String Analyzer when a sample or dump needs fast indicator discovery before reverse engineering or sandboxing. The output is designed to feed IOC review, infrastructure pivoting, YARA/Sigma ideas, and ATT&CK-mapped analyst notes.
Defender Outputs
| Output | Use |
|---|---|
| Categorized strings | IOC and behavior discovery |
| URLs / IPs / emails | Pivot and enrichment leads |
| Registry / paths / DLLs | Host behavior context |
| API names | Capability triage |
| Decoded candidates | Obfuscation review |
| AI-ready prompt | Structured analyst follow-up |
String Analyzer extracts and analyzes printable strings from binary files. It is designed for malware analysts, reverse engineers, and forensics investigators who need to quickly surface URLs, IPs, registry keys, API names, and other indicators from executables, memory dumps, or disk images—and optionally generate an AI-ready analysis prompt.
- Zero runtime dependencies (Python standard library only).
- Single entry point: one CLI with batch and interactive modes.
- Library-friendly API: use
analyze_file()or lower-level functions in your own scripts.
📖 Practical guide (Medium) — step-by-step usage, workflows, and examples.
Table of contents
- Features
- Installation
- Quick start
- Usage
- Pattern categories
- Programmatic API
- Examples
- Configuration and limits
- Security and safety
- Development
- License
Features
| Feature | Description |
|---|---|
| String extraction | ASCII and UTF-16LE (Windows PE); configurable min length and max_bytes; chunked read for large files. |
| Entropy | Shannon entropy (chunked when max_bytes set); high entropy suggests packed/encrypted content. |
| Pattern detection | Strict IPv4 (0–255), IPv6 (full and abbreviated), URLs (http/https/ftp/file/ws/wss), obfuscated URLs (hxxp, etc.), emails, MAC addresses, registry keys, system paths, DLLs, 300+ Windows APIs, CMD/PowerShell, obfuscation patterns. |
| Embedded extraction | URLs, IPs, emails, MACs found inside long strings (not only whole-line matches). |
| Decoding | Base64 (standard and URL-safe) and hex; decoded candidates in report. |
| Suspicious keywords | Extended set: malware, miner, steal, persist, evasion, etc., plus .NET namespaces. |
| Sensitive mode | --sensitive: lower obfuscation thresholds and more keywords for stricter triage. |
| Output formats | Unfiltered dump, categorized report, or AI-ready markdown prompt. |
| CLI & API | Full CLI (--encoding, --sensitive, --no-embedded); programmatic analyze_file(); no global state. |
Installation
Requirements: Python 3.8 or newer.
git clone https://github.com/anpa1200/String-Analyzer-.git && cd String-Analyzer-
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e .
After installation you get the string-analyzer command. From the project root you can also run:
python -m string_analyzer
Development (optional): pip install -e ".[dev]" adds pytest and ruff for tests and linting.
Quick start
# Categorized report (default)
string-analyzer /path/to/binary -o report.txt
# All extracted strings, no categorization
string-analyzer /path/to/binary --unfiltered -o strings.txt
# AI-ready analysis prompt
string-analyzer /path/to/binary --ai-prompt -o prompt.md
# Interactive: prompt for file and output type
string-analyzer
Usage
Command-line options
| Option | Description |
|---|---|
file |
Path to the binary file. Omit to run interactive mode. |
-o, --output PATH |
Output file (default: <basename>_strings.txt). |
--min-length N |
Minimum string length to extract (default: 4). |
--max-bytes N |
Stop reading after N bytes (safety for very large files). |
--unfiltered |
Output all extracted strings, one per line (no categories). |
--filtered |
Output categorized report (default when not using --unfiltered or --ai-prompt). |
--ai-prompt |
Generate markdown prompt for an AI assistant. |
--analyze-with {gemini,codex} |
Send categorized prompt to gemini-cli or codex-cli and print the AI analysis. Saves the prompt to -o; use --ai-output to save the AI response. |
--ai-output PATH |
Save the AI response to this file (when using --analyze-with). |
--encoding {ascii,utf16,both} |
Extract ASCII only, UTF-16LE only, or both (default: both). |
--sensitive |
Lower obfuscation thresholds; more suspicious keywords. |
--no-embedded |
Do not extract URLs/IPs/emails from inside long strings. |
-i, --interactive |
Force interactive mode (prompt for file and options). |
-q, --quiet |
Suppress non-error messages. |
-v, --verbose |
Verbose logging. |
--version |
Show version. |
--help |
Show help. |
Output modes
- Unfiltered (
--unfiltered): sorted list of all extracted strings. Use for grepping or feeding into other tools. - Filtered (default): categorized report with entropy, plus sections such as URLS, IPS, WINDOWS_API_COMMANDS, DLLS, OBFUSCATED, etc.
- AI prompt (
--ai-prompt): same categories in a markdown prompt asking an AI to analyze behavior and functionality (e.g. for malware triage).
External AI analysis (--analyze-with)
The --analyze-with option sends the categorized string report directly to an AI CLI so you get an analysis in one command instead of copying a prompt by hand.
- What it does: After extracting and categorizing strings (URLs, IPs, APIs, DLLs, obfuscation, etc.), the tool builds the same markdown prompt used by
--ai-prompt, writes it to the path given by-o(so you can keep or reuse it), then pipes that prompt into the chosen CLI. The AI’s reply is printed to the terminal; you can save it with--ai-output PATH. - Values:
gemini— uses gemini-cli (looks forgeminiorgemini-clion your PATH).codex— uses Codex CLI (codex exec -with the prompt on stdin). - Requirements: You must have one of these installed and on your PATH: Gemini CLI (e.g.
npm i -g @google/generative-ai-cli) or Codex CLI. The tool does not call cloud APIs itself; it only invokes the local CLI, which handles authentication and the model. - Example:
string-analyzer suspect.exe --analyze-with gemini -o prompt.txt --ai-output analysis.md
This saves the prompt toprompt.txt, sends it to Gemini, and writes the AI’s analysis toanalysis.md.
Interactive mode
Run string-analyzer with no file argument (or use string-analyzer -i). The tool will:
- Ask for the file path.
- Ask whether to output all strings (unfiltered) or a filtered report.
- If filtered: ask whether to generate an AI prompt or a normal report.
- Ask for the output file path (with a default suggestion).
Interactive mode limits input to 50 MB by default to avoid accidental resource use.
Pattern categories
Strings are classified into the following categories (empty categories are omitted from output):
| Category | Description |
|---|---|
WINDOWS_API_COMMANDS |
Known Windows API function names (300+). |
DLLS |
Strings matching typical DLL names (e.g. *.dll). |
URLS |
HTTP/HTTPS and similar URLs. |
IPS |
IPv4 addresses. |
IPV6 |
IPv6 addresses. |
EMAILS |
Email-like strings. |
WINDOWS_REGISTRY_KEYS |
Registry path patterns. |
POWERSHELL_COMMANDS |
PowerShell cmdlets/commands. |
CMD_COMMANDS |
CMD shell commands. |
FILES |
File path / filename patterns. |
SYSTEM_PATHS |
System directory paths. |
OBFUSCATED |
Patterns suggesting obfuscation (e.g. h[.]xxp, dotted IPs). |
DECODED_BASE64 |
Strings that successfully decode from Base64 to printable text. |
DECODED_HEX |
Strings that successfully decode from hex to printable text. |
SUSPICIOUS_KEYWORDS |
Substrings associated with malware (e.g. key terms). |
SUSPICIOUS_DOTNET |
.NET-related suspicious namespaces/keywords. |
MAC_ADDRESSES |
MAC addresses (e.g. 00:1A:2B:3C:4D:5E). |
The tool also computes file entropy. Combined with a low count of “useful” patterns (APIs, DLLs, CMD/PowerShell), high entropy can indicate a packed or obfuscated binary; this is noted in the report and in the AI prompt.
Programmatic API
Use the package in your own Python code:
from string_analyzer import (
analyze_file,
extract_strings,
detect_patterns,
compute_file_entropy,
generate_normal_output,
generate_ai_prompt,
shannon_entropy,
)
from string_analyzer.analyzer import (
is_likely_obfuscated,
is_mostly_printable,
try_base64_decode,
try_hex_decode,
)
One-shot analysis
result = analyze_file(
"/path/to/binary",
min_length=4,
max_bytes=None,
encoding="both", # "ascii", "utf16", or "both"
extract_embedded=True, # find URLs/IPs inside long strings
sensitive=False, # True: lower obfuscation thresholds
)
# result["file"], result["entropy"], result["strings"], result["patterns"], result["obfuscated"]
Step-by-step
from pathlib import Path
path = Path("sample.bin")
entropy = compute_file_entropy(path)
strings = extract_strings(path, min_length=4, max_bytes=10_000_000)
patterns = detect_patterns(strings) # New dict every time; no global state
obfuscated = is_likely_obfuscated(patterns, entropy)
report = generate_normal_output(patterns, entropy, obfuscated)
# Or: prompt_text = generate_ai_prompt(patterns, entropy, obfuscated)
Function reference
| Function | Description |
|---|---|
analyze_file(path, min_length=4, max_bytes=None) |
Full analysis; returns dict with file, entropy, strings, patterns, obfuscated. |
extract_strings(path, min_length=4, max_bytes=None) |
Extract unique printable strings; returns set[str]. |
compute_file_entropy(path) |
Shannon entropy of file bytes. |
shannon_entropy(s) |
Shannon entropy of a string. |
detect_patterns(strings) |
Categorize strings; returns new dict[str, set[str]]. |
is_likely_obfuscated(patterns, file_entropy) |
Heuristic: few “useful” patterns and entropy > threshold. |
generate_normal_output(patterns, entropy, obfuscated) |
Formatted filtered report text. |
generate_ai_prompt(patterns, entropy, obfuscated) |
Markdown prompt text for AI analysis. |
is_mostly_printable(s, threshold=0.9) |
Whether the string is mostly printable ASCII. |
try_base64_decode(s) |
Decode Base64 if valid and printable; else None. |
try_hex_decode(s) |
Decode hex if valid and printable; else None. |
Examples
Malware triage — get an AI prompt for a sample:
string-analyzer suspect.exe --ai-prompt -o triage_prompt.md
# Then paste triage_prompt.md into your AI assistant.
Large file — limit read size and get a filtered report:
string-analyzer memory.dump --max-bytes 100000000 -o report.txt
Script — use API and only print URLs and IPs:
from string_analyzer import analyze_file
r = analyze_file("sample.bin")
for s in r["patterns"].get("URLS", []):
print(s)
for s in r["patterns"].get("IPS", []):
print(s)
Longer strings only:
string-analyzer binary --min-length 8 -o long_strings.txt
Maximum sensitivity (UTF-16 + embedded URLs + lower obfuscation bar):
string-analyzer suspect.exe --encoding both --sensitive -o report.txt
Send to Gemini or Codex for AI analysis (requires gemini-cli or codex on PATH):
string-analyzer suspect.exe --analyze-with gemini -o prompt.txt --ai-output analysis.md
string-analyzer suspect.exe --analyze-with codex --ai-output analysis.md
Configuration and limits
- Minimum string length:
--min-length(default 4). Longer values reduce noise and speed up analysis. - Maximum bytes read:
--max-bytes. Omit for no limit; set for very large files to avoid high memory use. - Obfuscation heuristic: Implemented using
MIN_USEFUL_COUNT(default 10) andENTROPY_THRESHOLD(default 5.0) instring_analyzer.patterns. A file is flagged as likely obfuscated when the number of “useful” patterns (Windows API, DLLs, CMD, PowerShell) is below the count threshold and file entropy is above the entropy threshold.
Security and safety
- Input files: String Analyzer only reads the file and extracts printable strings; it does not execute or interpret code. Still, avoid running it on untrusted binaries in a sensitive environment without proper isolation.
- Large files: Use
--max-bytes(or themax_bytesparameter in the API) to cap how much is read; interactive mode uses a 50 MB default. - Output: Reports may contain URLs, IPs, and other indicators. Handle output according to your security and privacy policies.
Development
pip install -e ".[dev]"
ruff check string_analyzer tests
pytest tests/ -v
CI runs on push/PR: Ruff lint and pytest on Python 3.8, 3.10, and 3.12.
Documentation: Practical guide (Medium) · docs/DOCUMENTATION.md (patterns, heuristics, workflows)
Related repositories & articles
| Resource | Link |
|---|---|
| String-Analyzer (this repo) | GitHub · Medium: String Analyzer Guide |
| Static-malware-Analysis-Orchestrator | GitHub — one-command pipeline (triage, strings, PE imports, unpack) · Medium: Full workflow |
| PE-Import-Analyzer | GitHub · Medium: PE Import Analyzer Guide |
| Unpacker | GitHub · Medium: Unpacker Guide |
| Basic-File-Information-Gathering-Script | GitHub · Medium: File Metadata & Static Analysis |
| Author | Medium @1200km |
License
Distributed under the GNU General Public License v3.0. See LICENSE for details.
Contributions are welcome; please open an issue or submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file string_analyzer-2.0.0.tar.gz.
File metadata
- Download URL: string_analyzer-2.0.0.tar.gz
- Upload date:
- Size: 42.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1f87a3a50a90c7dabbeddeec9de9744c7194ad9f5bd19d7a46f571de6da3625
|
|
| MD5 |
24e2761a6a03193cb0ba60b9bb2ffe41
|
|
| BLAKE2b-256 |
23f720675f2be545886758ce311434850e89218751dc57c40919a271bea7fdd6
|
Provenance
The following attestation bundles were made for string_analyzer-2.0.0.tar.gz:
Publisher:
publish.yml on anpa1200/String-Analyzer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
string_analyzer-2.0.0.tar.gz -
Subject digest:
c1f87a3a50a90c7dabbeddeec9de9744c7194ad9f5bd19d7a46f571de6da3625 - Sigstore transparency entry: 1813464308
- Sigstore integration time:
-
Permalink:
anpa1200/String-Analyzer@7a9211e5491951a32b0bb3a0ac8c30b448d9ac17 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/anpa1200
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7a9211e5491951a32b0bb3a0ac8c30b448d9ac17 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file string_analyzer-2.0.0-py3-none-any.whl.
File metadata
- Download URL: string_analyzer-2.0.0-py3-none-any.whl
- Upload date:
- Size: 36.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ab1f803b00b9d6f4fbefa9c9daf8dd3737a9a921728390ce36507c0971ddb12
|
|
| MD5 |
9c3eb8ad4f6149c0f5f7ca815a2a085f
|
|
| BLAKE2b-256 |
a13bec3835a3ed6bc4617c79b5dcb449e504081f1fe6eab8123735fd6e04d468
|
Provenance
The following attestation bundles were made for string_analyzer-2.0.0-py3-none-any.whl:
Publisher:
publish.yml on anpa1200/String-Analyzer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
string_analyzer-2.0.0-py3-none-any.whl -
Subject digest:
5ab1f803b00b9d6f4fbefa9c9daf8dd3737a9a921728390ce36507c0971ddb12 - Sigstore transparency entry: 1813464339
- Sigstore integration time:
-
Permalink:
anpa1200/String-Analyzer@7a9211e5491951a32b0bb3a0ac8c30b448d9ac17 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/anpa1200
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7a9211e5491951a32b0bb3a0ac8c30b448d9ac17 -
Trigger Event:
workflow_dispatch
-
Statement type: