The Real First Universal Charset Detector. No Cpp Bindings, Using Voodoo and Magical Artifacts.
Project description
Welcome to Charset Detection for Humans 👋
The Real First Universal Charset Detector
A library that helps you read text from an unknown charset encoding.
Motivated bychardet
, I'm trying to resolve the issue by taking a new approach. All IANA character set names for which the Python core library provides codecs are supported.
>>>>> ❤️ Try Me Online Now, Then Adopt Me ❤️ <<<<<
This project offers you an alternative to Universal Charset Encoding Detector, also known as Chardet.
Feature | Chardet | Charset Normalizer | cChardet |
---|---|---|---|
Fast |
❌ |
❌ |
✅ |
Universal** |
❌ | ✅ | ❌ |
Reliable without distinguishable standards |
❌ | ✅ | ✅ |
Reliable with distinguishable standards |
✅ | ✅ | ✅ |
Free & Open |
✅ | ✅ | ✅ |
License |
LGPL-2.1 | MIT | MPL-1.1 |
Native Python |
✅ | ✅ | ❌ |
Detect spoken language |
❌ | ✅ | N/A |
Supported Encoding |
30 | :tada: 90 | 40 |
Package | Accuracy | Mean per file (ns) | File per sec (est) |
---|---|---|---|
chardet | 93.5 % | 126 081 168 ns | 7.931 file/sec |
cchardet | 97.0 % | 1 668 145 ns | 599.468 file/sec |
charset-normalizer | 97.25 % | 209 503 253 ns | 4.773 file/sec |
** : They are clearly using specific code for a specific encoding even if covering most of used one
Your support
Please ⭐ this repository if this project helped you!
✨ Installation
Using PyPi
pip install charset_normalizer
🚀 Basic Usage
CLI
This package comes with a CLI
usage: normalizer [-h] [--verbose] [--normalize] [--replace] [--force]
file [file ...]
normalizer ./data/sample.1.fr.srt
+----------------------+----------+----------+------------------------------------+-------+-----------+
| Filename | Encoding | Language | Alphabets | Chaos | Coherence |
+----------------------+----------+----------+------------------------------------+-------+-----------+
| data/sample.1.fr.srt | cp1252 | French | Basic Latin and Latin-1 Supplement | 0.0 % | 84.924 % |
+----------------------+----------+----------+------------------------------------+-------+-----------+
Python
Just print out normalized text
from charset_normalizer import CharsetNormalizerMatches as CnM
print(CnM.from_path('./my_subtitle.srt').best().first())
Normalize any text file
from charset_normalizer import CharsetNormalizerMatches as CnM
try:
CnM.normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt
except IOError as e:
print('Sadly, we are unable to perform charset normalization.', str(e))
Upgrade your code without effort
from charset_normalizer import detect
The above code will behave the same as chardet.
See the docs for advanced usage : readthedocs.io
😇 Why
When I started using Chardet, I noticed that it was unreliable nowadays and also it's unmaintained, and most likely will never be.
I don't care about the originating charset encoding, because two different tables can produce two identical files. What I want is to get readable text, the best I can.
In a way, I'm brute forcing text decoding. How cool is that ? 😎
Don't confuse package ftfy with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
🍰 How
- Discard all charset encoding table that could not fit the binary content.
- Measure chaos, or the mess once opened with a corresponding charset encoding.
- Extract matches with the lowest mess detected.
- Finally, if there is too much match left, we measure coherence.
Wait a minute, what is chaos/mess and coherence according to YOU ?
Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to improve or rewrite it.
Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
⚡ Known limitations
- Not intended to work on non (human) speakable language text content. eg. crypted text.
- Language detection is unreliable when text contains two or more languages sharing identical letters.
- Not well tested with tiny content.
👤 Contributing
Contributions, issues and feature requests are very much welcome.
Feel free to check issues page if you want to contribute.
📝 License
Copyright © 2019 Ahmed TAHRI @Ousret.
This project is MIT licensed.
Letter appearances frequencies used in this project © 2012 Denny Vrandečić
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.