Skip to main content

detect flaw or encoding error in unicode text

Project description

flawunicode

Detect unreadable unicode text

Ever encounter any text when crawl text from the internet or inside your raw corpus?

srtytyrtyrty
Á¶À̽ÃƼ, ¡®3on3 ÇÁ¸®½ºÅ¸ÀÏ¡¯ 2Á¾ÀÇ ¿¡µð¼Ç ¹øµé Ãâ½Ã
��>+ٽT}$@�������Э����ٗ_���=���e��

This is what flawunicode aims to pick these out for you. flawunicode ranks each unicode text and output a score of -1 to 1 which indicates the "completeness" of the unicode text. If the text has a score of lower than 0.4, it is likely this text is not readable by human.

Usage

import flawunicode
text = "fdsfdxvdhjkf"
flawunicode.detect(text)
>> 0.2727272727272727
flawunicode.detect("Hello World!")
>> 0.6439393939393939

Note

The underlying statistic came from news corpus in currents api database. So social network style text maybe rank with low score. You just need to calculate your own frequently used bi-gram characters and it should be fine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flawunicode-0.1.3.tar.gz (16.5 MB view details)

Uploaded Source

File details

Details for the file flawunicode-0.1.3.tar.gz.

File metadata

  • Download URL: flawunicode-0.1.3.tar.gz
  • Upload date:
  • Size: 16.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for flawunicode-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2f940388153a80ecdb926ffeaf26c419425b7423a65279f3c2509c5ad7d1803b
MD5 e395d00d43c857180970ca95899b6368
BLAKE2b-256 89dc8b10e4f9340a25ae24133b4f2ecd628a49b7f3092761682eb020a102c5d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page