Simple text encoding type classifier
Project description
whatenc is a command-line tool that uses a gradient-boosted tree classifier to detect the encoding of a given string or file.
The model is trained on text samples from the Wikipedia corpus, with lines encoded using multiple encoding schemes to generate labeled examples.
How It Works
whatenc applies a feature-based approach to characterize text, then feeds these features into a gradient-boosted decision tree model to classify the encoding.
Feature Extraction
Each input string is converted into a feature vector describing its statistical properties.
Features include:
| Feature | Description |
|---|---|
Length (n) |
Number of characters in the input |
n % 4 |
Useful for identifying base-N encodings |
| Printable Ratio | Fraction of characters in string.printable |
| Alphabetic / Digit Ratios | Ratio of letters and digits to total length |
Padding Ratio (=) |
Common in Base64/32 encodings |
| Compressibility | Ratio of compressed to raw byte length |
| Shannon Entropy | Measure of randomness in character distribution |
| English Letter Correlation | Correlation between letter frequencies and English letter frequency distribution |
| Stopword Ratio | Fraction of English stopwords |
Supported Encodings
whatenc currently recognizes the following formats and transformations:
| Category | Encodings |
|---|---|
| Base encodings | base32, base64, base85, hex, url |
| Text ciphers | rot13, rot47, morse |
| Compression | gzip64 |
| Hash digests | md5, sha1, sha224, sha256, sha384, sha512 |
Installation
You can install whatenc using pipx:
pipx install whatenc
Usage
whatenc aGVsbG8gd29ybGQ=
whatenc samples.txt
Examples
[+] input: aGVsbG8gd29ybGQ=
[=] top guess = base64
[~] base64 = 0.455
[~] plain = 0.312
[~] url = 0.126
[+] input: hello
[=] top guess = plain
[~] plain = 0.552
[~] url = 0.246
[~] rot13 = 0.192
[+] input: uryyb jbeyq
[=] top guess = rot13
[~] rot13 = 0.555
[~] plain = 0.440
[~] url = 0.004
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file whatenc-0.3.1.tar.gz.
File metadata
- Download URL: whatenc-0.3.1.tar.gz
- Upload date:
- Size: 184.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e3943ecaa6d0cd2a342fc87663ab2c298269d9c79f930976372a86f3d5ca804
|
|
| MD5 |
66f0c2aa8150f66fec35250203077c52
|
|
| BLAKE2b-256 |
8da03bcbf11fcce69e6f08d3e7d343d00b3ac18194805b967099eb79f52959fb
|
File details
Details for the file whatenc-0.3.1-py3-none-any.whl.
File metadata
- Download URL: whatenc-0.3.1-py3-none-any.whl
- Upload date:
- Size: 188.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab4819194565b01ad8816bc16f3b51efae4658278b275bbb2d3d279319c10f2f
|
|
| MD5 |
912cb1e8615ba7c3c957636bf47f8211
|
|
| BLAKE2b-256 |
57768f00054a72b596e7c6c6e23db897c75cc9c2817c49cc2a2531a52382a260
|