Text encoding type classifier
Project description
whatenc is a command-line tool that uses a gradient-boosted tree classifier to detect the encoding of a given string or file.
The model is trained on text samples from the English, Greek, Russian, Hebrew, and Arabic Wikipedia corpora, chosen to represent a diverse set of writing systems (Latin, Greek, Cyrillic, Hebrew, and Arabic scripts). Each line is encoded using multiple encoding schemes to generate labeled examples.
How It Works
whatenc applies a feature-based approach to characterize text, then feeds these features into a gradient-boosted decision tree model to classify the encoding.
Feature Extraction
Each input string is converted into a feature vector describing its statistical properties.
Features include:
| Feature | Description |
|---|---|
Length (n) |
Number of characters in the input |
| Alphabetic / Digit Ratios | Ratio of letters and digits to total length |
Padding Ratio (=) |
Common in Base64/32 encodings |
| Compressibility | Ratio of compressed to raw byte length |
| Shannon Entropy | Measure of randomness in single-character frequency distribution |
| Bigram Entropy | Measure of randomness in two-character (bigram) frequency distribution |
| Non-ASCII Ratio | Fraction of characters outside the ASCII range |
| Word Density | ratio of string length to word count |
Supported Encodings
whatenc currently recognizes the following formats and transformations:
| Category | Encodings |
|---|---|
| Base encodings | base32, base64, base85, hex, url |
| Text ciphers | morse |
| Compression | gzip64 |
| Hash digests | md5, sha1, sha224, sha256, sha384, sha512 |
Installation
You can install whatenc using pipx:
pipx install whatenc
Usage
whatenc hello
whatenc samples.txt
Examples
[+] input: ZW5jb2RlIHRvIGJhc2U2NCBmb3JtYXQ=
[~] top guess = base64
[=] base64 = 0.875
[=] base32 = 0.101
[=] gzip64 = 0.019
[+] input: hi
[~] top guess = plain
[=] plain = 0.772
[=] base64 = 0.081
[=] base32 = 0.075
[+] input: bfa99df33b137bc8fb5f5407d7e58da8
[~] top guess = md5
[=] md5 = 1.000
[=] sha1 = 0.000
[=] url = 0.000
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file whatenc-0.5.1.tar.gz.
File metadata
- Download URL: whatenc-0.5.1.tar.gz
- Upload date:
- Size: 319.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
914d2c7228be93ee4f724872478518268af7c62f10ffd24b77f70dfde9b1b78e
|
|
| MD5 |
b51a1cd82fb456b28045d78461ad5a9c
|
|
| BLAKE2b-256 |
49a26869e399b73084dec8a408e32a8f38ce3ed0013e3c75f6d1c5595428a939
|
File details
Details for the file whatenc-0.5.1-py3-none-any.whl.
File metadata
- Download URL: whatenc-0.5.1-py3-none-any.whl
- Upload date:
- Size: 329.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df9cf9f5d45dea31e19390731a3bf529d1f90158d2f410675fb2a3b1f761e7d7
|
|
| MD5 |
2e2855a2ebe5f040a45973225a90ebf3
|
|
| BLAKE2b-256 |
959271bff4271111d84b33a877d09bb40f89720890cbc9a87cb4a52307e35cc6
|