Skip to main content

A Python package for detecting hidden Unicode and ASCII characters.

Project description

ByteSleuth_Banner

🕵️‍♂️ ByteSleuth — The Ghost Hunter for Hidden Characters

"Elementary, my dear dev. The ghosts of hidden characters won't escape this audit!" — CharlockHolmes, the detective inside ByteSleuth

ByteSleuth is a powerful Unicode & ASCII character scanner designed to detect obfuscation, invisible threats, and suspicious bytes lurking in text or code. Whether you're hunting down ghost characters or analyzing unexpected encoding issues, ByteSleuth ensures a clean and transparent result.


🚀 Key Features

  • ✅ Detects ASCII control characters (e.g., NUL, BEL, ESC)
  • ✅ Flags Unicode invisibles and directional controls (e.g., U+200B, U+202E)
  • ✅ Optionally sanitizes input by removing hidden/malicious characters
  • ✅ Works seamlessly with files, directories, and stdin/PIPE
  • ✅ Supports logging for audit trails
  • ✅ Generates SHA256 hash before/after sanitization
  • ✅ Outputs JSON reports (stdout or file)
  • Concurrent directory scanning for speed
  • Fail on detect mode for CI/CD/pre-commit
  • Backup/restore before sanitization
  • VSCode extension for easy integration
  • Pre-commit & CI/CD integration examples
  • Real-world examples included

🔧 CLI Usage

python src/byte_sleuth.py <target> [options]

CLI Options

Option Description
target File or directory to scan (or use PIPE input)
-s, --sanitize Automatically remove suspicious characters
-l, --log Log file to write results (default: scanner.log)
-r, --report [file] Print JSON report to stdout or save to file
-f, --no-backup Disable backup creation
-v, --verbose Enable verbose output (shows hashes, findings)
-d, --debug Enable debug output
-q, --quiet Suppress all output except errors
-S, --sanitize-only Only sanitize, do not scan/report
-F, --fail-on-detect Exit with code 1 if suspicious characters are found
-V, --version Show version and exit

CLI Examples

# Scan and sanitize a file, showing hashes and findings
python byte_sleuth/byte_sleuth.py suspicious.txt -s -v

# Scan a directory, output JSON report to file
python byte_sleuth/byte_sleuth.py ./data/ -r report.json

# Sanitize stdin (PIPE), output to sanitized.txt
cat file.txt | python byte_sleuth/byte_sleuth.py -s > sanitized.txt

# Scan from PIPE and fail (exit 1) if any suspicious character is found (for CI/pre-commit)
cat file.txt | python byte_sleuth/byte_sleuth.py -F

# Log all removed characters from PIPE to a custom log file
cat file.txt | python byte_sleuth/byte_sleuth.py -s -l removed_chars.log > sanitized.txt

# Scan a directory and fail if any file contains suspicious characters (CI/pre-commit)
python byte_sleuth/byte_sleuth.py src/ -F

📦 Using ByteSleuth in Your Python Projects

Installation

Once published to PyPI:

pip install byte-sleuth

Basic Usage in Python

from byte_sleuth import ByteSleuth
scanner = ByteSleuth(sanitize=True)
findings = scanner.scan_file("example.txt")
for cp, name, char, idx in findings:
    print(f"⚠️ Suspicious Character: {name} (U+{cp:04X}) at position {idx}{repr(char)}")

🔁 Automation & Integration

  • Pre-commit hook: Block commits with hidden characters
  • CI/CD pipelines: Fail builds if issues are found
  • VSCode extension: Scan open files with one click
  • JSON reports: For audit or further automation

Pre-commit Example

# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: byte-sleuth-scan
      name: ByteSleuth Unicode & ASCII Scanner
      entry: python byte_sleuth/byte_sleuth.py src/ -F
      language: system
      pass_filenames: false

GitHub Actions Example

- name: Scan for hidden characters
  run: cat file.txt | python byte_sleuth/byte_sleuth.py -F

🧑‍💻 VSCode Extension

  • Scan the current file for hidden/suspicious characters
  • See results directly in VSCode
  • Easy to install and use (see vscode-extension/README.md)

🧠 Why Use ByteSleuth?

Some characters are invisible but dangerous—causing confusion in source code, configs, or documents. Common attack vectors include:

  • Zero-width spaces for code obfuscation
  • Bidirectional override characters
  • Hidden ASCII control codes
  • Formatting trickery affecting debugging & diffs

ByteSleuth gives you a detective's magnifying glass to expose them all. 🔍

Comparison with other tools

Tool Unicode ASCII Control Sanitization JSON Report CLI/Automation VSCode Integration
ByteSleuth ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
grep/sed ✔️ ✔️
ad-hoc scripts ✔️ ✔️
  • ByteSleuth covers Unicode, ASCII, sanitizes, generates reports, and integrates easily with automation and VSCode.
  • grep/sed are great for simple ASCII, but do not cover Unicode or sanitization.
  • Ad-hoc scripts are fragile and hard to maintain.

🚀 Roadmap

  • Expand sanitization methods
  • Improve CLI interactivity
  • Output JSON reports
  • VSCode Extension
  • HTML reports
  • Support for more file formats (zip, PDF, etc.)
  • Public changelog/roadmap

📄 License

MIT — Feel free to sleuth away!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytesleuth-1.0.3.tar.gz (491.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bytesleuth-1.0.3-py3-none-any.whl (490.1 kB view details)

Uploaded Python 3

File details

Details for the file bytesleuth-1.0.3.tar.gz.

File metadata

  • Download URL: bytesleuth-1.0.3.tar.gz
  • Upload date:
  • Size: 491.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for bytesleuth-1.0.3.tar.gz
Algorithm Hash digest
SHA256 898982c05e4aa08da362ef5e24d66478e67434fdd674436e6c773b124eb5a3ff
MD5 c68bdae8eb16ffc18f296239e9fe0f44
BLAKE2b-256 e36fe689e2a6f3decdfb8932243c6d44239f877d1fccb8f5f9049153ad1cdaaf

See more details on using hashes here.

File details

Details for the file bytesleuth-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: bytesleuth-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 490.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for bytesleuth-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 11f74892ebd7faad3ce7db69a8deed975483a783cbecb76639f62c82bde35012
MD5 d52dcfbece1f9eed0235784984f70353
BLAKE2b-256 70e34389f257558b38803d0d096012a9e9ab573f7b8d77d44e4c913fb7b64ad9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page