A Python package for detecting hidden Unicode and ASCII characters.
Project description
🕵️♂️ ByteSleuth — The Ghost Hunter for Hidden Characters
"Elementary, my dear dev. The ghosts of hidden characters won't escape this audit!" — CharlockHolmes, the detective inside ByteSleuth
ByteSleuth is a powerful Unicode & ASCII character scanner designed to detect obfuscation, invisible threats, and suspicious bytes lurking in text or code. Whether you're hunting down ghost characters or analyzing unexpected encoding issues, ByteSleuth ensures a clean and transparent result.
🚀 Key Features
- ✅ Detects ASCII control characters (e.g.,
NUL,BEL,ESC) - ✅ Flags Unicode invisibles and directional controls (e.g.,
U+200B,U+202E) - ✅ Optionally sanitizes input by removing hidden/malicious characters
- ✅ Works seamlessly with files, directories, and stdin/PIPE
- ✅ Supports logging for audit trails
- ✅ Generates SHA256 hash before/after sanitization
- ✅ Outputs JSON reports (stdout or file)
- ✅ Concurrent directory scanning for speed
- ✅ Fail on detect mode for CI/CD/pre-commit
- ✅ Backup/restore before sanitization
- ✅ VSCode extension for easy integration
- ✅ Pre-commit & CI/CD integration examples
- ✅ Real-world examples included
🔧 CLI Usage
python src/byte_sleuth.py <target> [options]
CLI Options
| Option | Description |
|---|---|
target |
File or directory to scan (or use PIPE input) |
-s, --sanitize |
Automatically remove suspicious characters |
-l, --log |
Log file to write results (default: scanner.log) |
-r, --report [file] |
Print JSON report to stdout or save to file |
-f, --no-backup |
Disable backup creation |
-v, --verbose |
Enable verbose output (shows hashes, findings) |
-d, --debug |
Enable debug output |
-q, --quiet |
Suppress all output except errors |
-S, --sanitize-only |
Only sanitize, do not scan/report |
-F, --fail-on-detect |
Exit with code 1 if suspicious characters are found |
-V, --version |
Show version and exit |
CLI Examples
python src/byte_sleuth.py suspicious.txt -s -v
python src/byte_sleuth.py ./data/ -r report.json
cat file.txt | python src/byte_sleuth.py -s > sanitized.txt
python src/byte_sleuth.py src/ -F # For CI/pre-commit: fail if any issue found
📦 Using ByteSleuth in Your Python Projects
Installation
Once published to PyPI:
pip install byte-sleuth
Basic Usage in Python
from byte_sleuth import ByteSleuth
scanner = ByteSleuth(sanitize=True)
findings = scanner.scan_file("example.txt")
for cp, name, char, idx in findings:
print(f"⚠️ Suspicious Character: {name} (U+{cp:04X}) at position {idx} → {repr(char)}")
🔁 Automation & Integration
- Pre-commit hook: Block commits with hidden characters
- CI/CD pipelines: Fail builds if issues are found
- VSCode extension: Scan open files with one click
- JSON reports: For audit or further automation
Pre-commit Example
# .pre-commit-config.yaml
- repo: local
hooks:
- id: byte-sleuth-scan
name: ByteSleuth Unicode & ASCII Scanner
entry: python src/byte_sleuth.py src/ -F
language: system
pass_filenames: false
GitHub Actions Example
- name: Scan for hidden characters
run: python src/byte_sleuth.py src/ -F
🧑💻 VSCode Extension
- Scan the current file for hidden/suspicious characters
- See results directly in VSCode
- Easy to install and use (see
vscode-extension/README.md)
🧠 Why Use ByteSleuth?
Some characters are invisible but dangerous—causing confusion in source code, configs, or documents. Common attack vectors include:
- Zero-width spaces for code obfuscation
- Bidirectional override characters
- Hidden ASCII control codes
- Formatting trickery affecting debugging & diffs
ByteSleuth gives you a detective's magnifying glass to expose them all. 🔍
Comparison with other tools
| Tool | Unicode | ASCII Control | Sanitization | JSON Report | CLI/Automation | VSCode Integration |
|---|---|---|---|---|---|---|
| ByteSleuth | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
| grep/sed | ❌ | ✔️ | ❌ | ❌ | ✔️ | ❌ |
| ad-hoc scripts | ❌ | ✔️ | ❌ | ❌ | ✔️ | ❌ |
- ByteSleuth covers Unicode, ASCII, sanitizes, generates reports, and integrates easily with automation and VSCode.
- grep/sed are great for simple ASCII, but do not cover Unicode or sanitization.
- Ad-hoc scripts are fragile and hard to maintain.
🚀 Roadmap
- Expand sanitization methods
- Improve CLI interactivity
- Output JSON reports
- VSCode Extension
- HTML reports
- Support for more file formats (zip, PDF, etc.)
- Public changelog/roadmap
📄 License
MIT — Feel free to sleuth away!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bytesleuth-1.0.1-py3-none-any.whl.
File metadata
- Download URL: bytesleuth-1.0.1-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
354a2de6f414a77543b3ee658b93678b2577daca454410cd37a2d8be44064189
|
|
| MD5 |
8e191601c45a07aa2c584f66974968f5
|
|
| BLAKE2b-256 |
965f3de4406cc7e90f414c36597dd48bcf5938013b3c410df8bbaa6637bbc5b8
|