Skip to main content

Text encoding type classifier

Project description

whatenc

PyPI License

Text encoding type classifier.

whatenc is a command-line tool that uses a gradient-boosted tree classifier to detect the encoding of a given string or file.

The model is trained on text samples from the English, Greek, Russian, Hebrew, and Arabic Wikipedia corpora, chosen to represent a diverse set of writing systems (Latin, Greek, Cyrillic, Hebrew, and Arabic scripts). Each line is encoded using multiple encoding schemes to generate labeled examples.

How It Works

whatenc applies a feature-based approach to characterize text, then feeds these features into a gradient-boosted decision tree model to classify the encoding.

Feature Extraction

Each input string is converted into a feature vector describing its statistical properties.

Features include:

Feature Description
Length (n) Number of characters in the input
Alphabetic / Digit Ratios Ratio of letters and digits to total length
Padding Ratio (=) Common in Base64/32 encodings
Compressibility Ratio of compressed to raw byte length
Shannon Entropy Measure of randomness in single-character frequency distribution
Bigram Entropy Measure of randomness in two-character (bigram) frequency distribution
Non-ASCII Ratio Fraction of characters outside the ASCII range
Word Density ratio of string length to word count

Supported Encodings

whatenc currently recognizes the following formats and transformations:

Category Encodings
Base encodings base32, base64, base85, hex, url
Text ciphers morse
Compression gzip64
Hash digests md5, sha1, sha224, sha256, sha384, sha512

Installation

You can install whatenc using pipx:

pipx install whatenc

Usage

whatenc hello
whatenc samples.txt

Examples

[+] input: ZW5jb2RlIHRvIGJhc2U2NCBmb3JtYXQ=
   [~] top guess   = base64
      [=] base64   = 0.875
      [=] base32   = 0.101
      [=] gzip64   = 0.019

[+] input: hi
   [~] top guess   = plain
      [=] plain    = 0.772
      [=] base64   = 0.081
      [=] base32   = 0.075

[+] input: bfa99df33b137bc8fb5f5407d7e58da8
   [~] top guess   = md5
      [=] md5      = 1.000
      [=] sha1     = 0.000
      [=] url      = 0.000

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whatenc-0.5.1.tar.gz (319.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whatenc-0.5.1-py3-none-any.whl (329.3 kB view details)

Uploaded Python 3

File details

Details for the file whatenc-0.5.1.tar.gz.

File metadata

  • Download URL: whatenc-0.5.1.tar.gz
  • Upload date:
  • Size: 319.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for whatenc-0.5.1.tar.gz
Algorithm Hash digest
SHA256 914d2c7228be93ee4f724872478518268af7c62f10ffd24b77f70dfde9b1b78e
MD5 b51a1cd82fb456b28045d78461ad5a9c
BLAKE2b-256 49a26869e399b73084dec8a408e32a8f38ce3ed0013e3c75f6d1c5595428a939

See more details on using hashes here.

File details

Details for the file whatenc-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: whatenc-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 329.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for whatenc-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df9cf9f5d45dea31e19390731a3bf529d1f90158d2f410675fb2a3b1f761e7d7
MD5 2e2855a2ebe5f040a45973225a90ebf3
BLAKE2b-256 959271bff4271111d84b33a877d09bb40f89720890cbc9a87cb4a52307e35cc6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page