detect flaw or encoding error in unicode text
Project description
flawunicode
Detect unreadable unicode text
Ever encounter any text when crawl text from the internet or inside your raw corpus?
srtytyrtyrty
Á¶À̽ÃƼ, ¡®3on3 ÇÁ¸®½ºÅ¸ÀÏ¡¯ 2Á¾ÀÇ ¿¡µð¼Ç ¹øµé Ãâ½Ã
��>+ٽT}$@�������Э����ٗ_���=���e��
This is what flawunicode aims to pick these out for you. flawunicode ranks each unicode text and output a score of -1 to 1 which indicates the "completeness" of the unicode text. If the text has a score of lower than 0.4, it is likely this text is not readable by human.
Usage
import flawunicode
text = "fdsfdxvdhjkf"
flawunicode.detect(text)
>> 0.2727272727272727
flawunicode.detect("Hello World!")
>> 0.6439393939393939
Note
The underlying statistic came from news corpus in currents api database. So social network style text maybe rank with low score. You just need to calculate your own frequently used bi-gram characters and it should be fine.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file flawunicode-0.1.3.tar.gz
.
File metadata
- Download URL: flawunicode-0.1.3.tar.gz
- Upload date:
- Size: 16.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f940388153a80ecdb926ffeaf26c419425b7423a65279f3c2509c5ad7d1803b |
|
MD5 | e395d00d43c857180970ca95899b6368 |
|
BLAKE2b-256 | 89dc8b10e4f9340a25ae24133b4f2ecd628a49b7f3092761682eb020a102c5d2 |