Extract and analyze printable strings from binary files for malware analysis and forensics

These details have not been verified by PyPI

Project links

Project description

String Analyzer

String extraction for CTI and malware-analysis workflows: surface URLs, IPs, paths, registry keys, APIs, commands, encoded data, and analyst-ready prompts from binaries and memory artifacts.

CTI Use

Use String Analyzer when a sample or dump needs fast indicator discovery before reverse engineering or sandboxing. The output is designed to feed IOC review, infrastructure pivoting, YARA/Sigma ideas, and ATT&CK-mapped analyst notes.

Defender Outputs

Output	Use
Categorized strings	IOC and behavior discovery
URLs / IPs / emails	Pivot and enrichment leads
Registry / paths / DLLs	Host behavior context
API names	Capability triage
Decoded candidates	Obfuscation review
AI-ready prompt	Structured analyst follow-up

String Analyzer extracts and analyzes printable strings from binary files. It is designed for malware analysts, reverse engineers, and forensics investigators who need to quickly surface URLs, IPs, registry keys, API names, and other indicators from executables, memory dumps, or disk images—and optionally generate an AI-ready analysis prompt.

Zero runtime dependencies (Python standard library only).
Single entry point: one CLI with batch and interactive modes.
Library-friendly API: use analyze_file() or lower-level functions in your own scripts.

📖 Practical guide (Medium) — step-by-step usage, workflows, and examples.

Features
Installation
Quick start
Usage
Pattern categories
Programmatic API
Examples
Configuration and limits
Security and safety
Development
License

Features

Feature	Description
String extraction	ASCII and UTF-16LE (Windows PE); configurable min length and `max_bytes`; chunked read for large files.
Entropy	Shannon entropy (chunked when `max_bytes` set); high entropy suggests packed/encrypted content.
Pattern detection	Strict IPv4 (0–255), IPv6 (full and abbreviated), URLs (http/https/ftp/file/ws/wss), obfuscated URLs (hxxp, etc.), emails, MAC addresses, registry keys, system paths, DLLs, 300+ Windows APIs, CMD/PowerShell, obfuscation patterns.
Embedded extraction	URLs, IPs, emails, MACs found inside long strings (not only whole-line matches).
Decoding	Base64 (standard and URL-safe) and hex; decoded candidates in report.
Suspicious keywords	Extended set: malware, miner, steal, persist, evasion, etc., plus .NET namespaces.
Sensitive mode	`--sensitive`: lower obfuscation thresholds and more keywords for stricter triage.
Output formats	Unfiltered dump, categorized report, or AI-ready markdown prompt.
CLI & API	Full CLI (`--encoding`, `--sensitive`, `--no-embedded`); programmatic `analyze_file()`; no global state.

Installation

Requirements: Python 3.8 or newer.

git clone https://github.com/anpa1200/String-Analyzer-.git && cd String-Analyzer-
python3 -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -e .

After installation you get the string-analyzer command. From the project root you can also run:

python -m string_analyzer

Development (optional): pip install -e ".[dev]" adds pytest and ruff for tests and linting.

Quick start

# Categorized report (default)
string-analyzer /path/to/binary -o report.txt

# All extracted strings, no categorization
string-analyzer /path/to/binary --unfiltered -o strings.txt

# AI-ready analysis prompt
string-analyzer /path/to/binary --ai-prompt -o prompt.md

# Interactive: prompt for file and output type
string-analyzer

Usage

Command-line options

Option	Description
`file`	Path to the binary file. Omit to run interactive mode.
`-o`, `--output PATH`	Output file (default: `<basename>_strings.txt`).
`--min-length N`	Minimum string length to extract (default: 4).
`--max-bytes N`	Stop reading after N bytes (safety for very large files).
`--unfiltered`	Output all extracted strings, one per line (no categories).
`--filtered`	Output categorized report (default when not using `--unfiltered` or `--ai-prompt`).
`--ai-prompt`	Generate markdown prompt for an AI assistant.
`--analyze-with {gemini,codex}`	Send categorized prompt to gemini-cli or codex-cli and print the AI analysis. Saves the prompt to `-o`; use `--ai-output` to save the AI response.
`--ai-output PATH`	Save the AI response to this file (when using `--analyze-with`).
`--encoding {ascii,utf16,both}`	Extract ASCII only, UTF-16LE only, or both (default: both).
`--sensitive`	Lower obfuscation thresholds; more suspicious keywords.
`--no-embedded`	Do not extract URLs/IPs/emails from inside long strings.
`-i`, `--interactive`	Force interactive mode (prompt for file and options).
`-q`, `--quiet`	Suppress non-error messages.
`-v`, `--verbose`	Verbose logging.
`--version`	Show version.
`--help`	Show help.

Output modes

Unfiltered (--unfiltered): sorted list of all extracted strings. Use for grepping or feeding into other tools.
Filtered (default): categorized report with entropy, plus sections such as URLS, IPS, WINDOWS_API_COMMANDS, DLLS, OBFUSCATED, etc.
AI prompt (--ai-prompt): same categories in a markdown prompt asking an AI to analyze behavior and functionality (e.g. for malware triage).

External AI analysis (`--analyze-with`)

The --analyze-with option sends the categorized string report directly to an AI CLI so you get an analysis in one command instead of copying a prompt by hand.

What it does: After extracting and categorizing strings (URLs, IPs, APIs, DLLs, obfuscation, etc.), the tool builds the same markdown prompt used by --ai-prompt, writes it to the path given by -o (so you can keep or reuse it), then pipes that prompt into the chosen CLI. The AI’s reply is printed to the terminal; you can save it with --ai-output PATH.
Values: gemini — uses gemini-cli (looks for gemini or gemini-cli on your PATH). codex — uses Codex CLI (codex exec - with the prompt on stdin).
Requirements: You must have one of these installed and on your PATH: Gemini CLI (e.g. npm i -g @google/generative-ai-cli) or Codex CLI. The tool does not call cloud APIs itself; it only invokes the local CLI, which handles authentication and the model.
Example:
string-analyzer suspect.exe --analyze-with gemini -o prompt.txt --ai-output analysis.md
This saves the prompt to prompt.txt, sends it to Gemini, and writes the AI’s analysis to analysis.md.

Interactive mode

Run string-analyzer with no file argument (or use string-analyzer -i). The tool will:

Ask for the file path.
Ask whether to output all strings (unfiltered) or a filtered report.
If filtered: ask whether to generate an AI prompt or a normal report.
Ask for the output file path (with a default suggestion).

Interactive mode limits input to 50 MB by default to avoid accidental resource use.

Pattern categories

Strings are classified into the following categories (empty categories are omitted from output):

Category	Description
`WINDOWS_API_COMMANDS`	Known Windows API function names (300+).
`DLLS`	Strings matching typical DLL names (e.g. `*.dll`).
`URLS`	HTTP/HTTPS and similar URLs.
`IPS`	IPv4 addresses.
`IPV6`	IPv6 addresses.
`EMAILS`	Email-like strings.
`WINDOWS_REGISTRY_KEYS`	Registry path patterns.
`POWERSHELL_COMMANDS`	PowerShell cmdlets/commands.
`CMD_COMMANDS`	CMD shell commands.
`FILES`	File path / filename patterns.
`SYSTEM_PATHS`	System directory paths.
`OBFUSCATED`	Patterns suggesting obfuscation (e.g. `h[.]xxp`, dotted IPs).
`DECODED_BASE64`	Strings that successfully decode from Base64 to printable text.
`DECODED_HEX`	Strings that successfully decode from hex to printable text.
`SUSPICIOUS_KEYWORDS`	Substrings associated with malware (e.g. key terms).
`SUSPICIOUS_DOTNET`	.NET-related suspicious namespaces/keywords.
`MAC_ADDRESSES`	MAC addresses (e.g. `00:1A:2B:3C:4D:5E`).

The tool also computes file entropy. Combined with a low count of “useful” patterns (APIs, DLLs, CMD/PowerShell), high entropy can indicate a packed or obfuscated binary; this is noted in the report and in the AI prompt.

Programmatic API

Use the package in your own Python code:

from string_analyzer import (
    analyze_file,
    extract_strings,
    detect_patterns,
    compute_file_entropy,
    generate_normal_output,
    generate_ai_prompt,
    shannon_entropy,
)
from string_analyzer.analyzer import (
    is_likely_obfuscated,
    is_mostly_printable,
    try_base64_decode,
    try_hex_decode,
)

One-shot analysis

result = analyze_file(
    "/path/to/binary",
    min_length=4,
    max_bytes=None,
    encoding="both",        # "ascii", "utf16", or "both"
    extract_embedded=True,  # find URLs/IPs inside long strings
    sensitive=False,        # True: lower obfuscation thresholds
)
# result["file"], result["entropy"], result["strings"], result["patterns"], result["obfuscated"]

Step-by-step

from pathlib import Path
path = Path("sample.bin")
entropy = compute_file_entropy(path)
strings = extract_strings(path, min_length=4, max_bytes=10_000_000)
patterns = detect_patterns(strings)  # New dict every time; no global state
obfuscated = is_likely_obfuscated(patterns, entropy)
report = generate_normal_output(patterns, entropy, obfuscated)
# Or: prompt_text = generate_ai_prompt(patterns, entropy, obfuscated)

Function reference

Function	Description
`analyze_file(path, min_length=4, max_bytes=None)`	Full analysis; returns dict with `file`, `entropy`, `strings`, `patterns`, `obfuscated`.
`extract_strings(path, min_length=4, max_bytes=None)`	Extract unique printable strings; returns `set[str]`.
`compute_file_entropy(path)`	Shannon entropy of file bytes.
`shannon_entropy(s)`	Shannon entropy of a string.
`detect_patterns(strings)`	Categorize strings; returns new `dict[str, set[str]]`.
`is_likely_obfuscated(patterns, file_entropy)`	Heuristic: few “useful” patterns and entropy > threshold.
`generate_normal_output(patterns, entropy, obfuscated)`	Formatted filtered report text.
`generate_ai_prompt(patterns, entropy, obfuscated)`	Markdown prompt text for AI analysis.
`is_mostly_printable(s, threshold=0.9)`	Whether the string is mostly printable ASCII.
`try_base64_decode(s)`	Decode Base64 if valid and printable; else `None`.
`try_hex_decode(s)`	Decode hex if valid and printable; else `None`.

Examples

Malware triage — get an AI prompt for a sample:

string-analyzer suspect.exe --ai-prompt -o triage_prompt.md
# Then paste triage_prompt.md into your AI assistant.

Large file — limit read size and get a filtered report:

string-analyzer memory.dump --max-bytes 100000000 -o report.txt

Script — use API and only print URLs and IPs:

from string_analyzer import analyze_file
r = analyze_file("sample.bin")
for s in r["patterns"].get("URLS", []):
    print(s)
for s in r["patterns"].get("IPS", []):
    print(s)

Longer strings only:

string-analyzer binary --min-length 8 -o long_strings.txt

Maximum sensitivity (UTF-16 + embedded URLs + lower obfuscation bar):

string-analyzer suspect.exe --encoding both --sensitive -o report.txt

Send to Gemini or Codex for AI analysis (requires gemini-cli or codex on PATH):

string-analyzer suspect.exe --analyze-with gemini -o prompt.txt --ai-output analysis.md
string-analyzer suspect.exe --analyze-with codex --ai-output analysis.md

Configuration and limits

Minimum string length: --min-length (default 4). Longer values reduce noise and speed up analysis.
Maximum bytes read: --max-bytes. Omit for no limit; set for very large files to avoid high memory use.
Obfuscation heuristic: Implemented using MIN_USEFUL_COUNT (default 10) and ENTROPY_THRESHOLD (default 5.0) in string_analyzer.patterns. A file is flagged as likely obfuscated when the number of “useful” patterns (Windows API, DLLs, CMD, PowerShell) is below the count threshold and file entropy is above the entropy threshold.

Security and safety

Input files: String Analyzer only reads the file and extracts printable strings; it does not execute or interpret code. Still, avoid running it on untrusted binaries in a sensitive environment without proper isolation.
Large files: Use --max-bytes (or the max_bytes parameter in the API) to cap how much is read; interactive mode uses a 50 MB default.
Output: Reports may contain URLs, IPs, and other indicators. Handle output according to your security and privacy policies.

Development

pip install -e ".[dev]"
ruff check string_analyzer tests
pytest tests/ -v

CI runs on push/PR: Ruff lint and pytest on Python 3.8, 3.10, and 3.12.

Documentation: Practical guide (Medium) · docs/DOCUMENTATION.md (patterns, heuristics, workflows)

Related repositories & articles

Resource	Link
String-Analyzer (this repo)	GitHub · Medium: String Analyzer Guide
Static-malware-Analysis-Orchestrator	GitHub — one-command pipeline (triage, strings, PE imports, unpack) · Medium: Full workflow
PE-Import-Analyzer	GitHub · Medium: PE Import Analyzer Guide
Unpacker	GitHub · Medium: Unpacker Guide
Basic-File-Information-Gathering-Script	GitHub · Medium: File Metadata & Static Analysis
Author	Medium @1200km

License

Distributed under the GNU General Public License v3.0. See LICENSE for details.

Contributions are welcome; please open an issue or submit a pull request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

string_analyzer-2.0.0.tar.gz (42.8 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

string_analyzer-2.0.0-py3-none-any.whl (36.9 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file string_analyzer-2.0.0.tar.gz.

File metadata

Download URL: string_analyzer-2.0.0.tar.gz
Upload date: Jun 14, 2026
Size: 42.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for string_analyzer-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c1f87a3a50a90c7dabbeddeec9de9744c7194ad9f5bd19d7a46f571de6da3625`
MD5	`24e2761a6a03193cb0ba60b9bb2ffe41`
BLAKE2b-256	`23f720675f2be545886758ce311434850e89218751dc57c40919a271bea7fdd6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for string_analyzer-2.0.0.tar.gz:

Publisher: publish.yml on anpa1200/String-Analyzer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: string_analyzer-2.0.0.tar.gz
- Subject digest: c1f87a3a50a90c7dabbeddeec9de9744c7194ad9f5bd19d7a46f571de6da3625
- Sigstore transparency entry: 1813464308
- Sigstore integration time: Jun 14, 2026
Source repository:
- Permalink: anpa1200/String-Analyzer@7a9211e5491951a32b0bb3a0ac8c30b448d9ac17
- Branch / Tag: refs/heads/main
- Owner: https://github.com/anpa1200
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7a9211e5491951a32b0bb3a0ac8c30b448d9ac17
- Trigger Event: workflow_dispatch

File details

Details for the file string_analyzer-2.0.0-py3-none-any.whl.

File metadata

Download URL: string_analyzer-2.0.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 36.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for string_analyzer-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ab1f803b00b9d6f4fbefa9c9daf8dd3737a9a921728390ce36507c0971ddb12`
MD5	`9c3eb8ad4f6149c0f5f7ca815a2a085f`
BLAKE2b-256	`a13bec3835a3ed6bc4617c79b5dcb449e504081f1fe6eab8123735fd6e04d468`

See more details on using hashes here.

Provenance

The following attestation bundles were made for string_analyzer-2.0.0-py3-none-any.whl:

Publisher: publish.yml on anpa1200/String-Analyzer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: string_analyzer-2.0.0-py3-none-any.whl
- Subject digest: 5ab1f803b00b9d6f4fbefa9c9daf8dd3737a9a921728390ce36507c0971ddb12
- Sigstore transparency entry: 1813464339
- Sigstore integration time: Jun 14, 2026
Source repository:
- Permalink: anpa1200/String-Analyzer@7a9211e5491951a32b0bb3a0ac8c30b448d9ac17
- Branch / Tag: refs/heads/main
- Owner: https://github.com/anpa1200
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7a9211e5491951a32b0bb3a0ac8c30b448d9ac17
- Trigger Event: workflow_dispatch

string-analyzer 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

String Analyzer

CTI Use

Defender Outputs

Table of contents

Features

Installation

Quick start

Usage

Command-line options

Output modes

External AI analysis (--analyze-with)

Interactive mode

Pattern categories

Programmatic API

One-shot analysis

Step-by-step

Function reference

Examples

Configuration and limits

Security and safety

Development

Related repositories & articles

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

External AI analysis (`--analyze-with`)