Skip to main content

Get detailed unicode information about characters and text

Project description

Unilyze: Get detailed unicode information

Unichar Class

This module helps you getting very detailed unicode information about single characters. It's simple to use, and presents data from unicode.org in a very easy readable and usable way.

First we import the Unilyze lib:

>> from unilyze import Unichar
>> from pprint import pprint

Now we can create an Unichar instance and use it:

>> uc = Unichar()
>> info = uc.ucd_info("a")
>> pprint(info)

{'ASCII_Hex_Digit': False,
 'Age': 'V1_1',
 'Alphabetic': True ...}

This will make a huge dict of attributes of the character. See FULL OUTPUT There are literally more than 100 attributes for each character!
You can also get the raw-data like this:

raw_info = uc.ucd_info_short("J")

You can also find out in what languages a unicode character is used:

>> info = uc.lng_usage("Ã¥")
>> pprint(info)

{'main': ['Danish',
          'Finnish',
          'Javanese',
          'Kalaallisut',...
}

Here you will get a huge dict with countries. See FULL OUTPUT

Unistat Class

This class is used to get statistics of strings instead of single characters. It's used for summing op information of each single character in the string.

>> from unilyze.unistat import Unistat
>> from pprint import pprint

>> us = Unistat()
>> us.add_text("This is a small test! 123")

>> unistat = us.unistat()
>> pprint(unistat, compact=True)

{'ASCII_Hex_Digit': {True: {'chars': {'1', '3', 'a', '2', 'e'},
                            'total-count': 6}},
 'Age': {'V1_1': {'chars': {' ', '!', '1', '2', '3', 'T', 'a', 'e', 'h', 'i',   
                            'l', 'm', 's', 't'},
                  'total-count': 25}},.........

Again we get a huge output grouped on UCD properties, and a count of the characters. See FULL OUTPUT

A simple count of each character can be done like this:

>> charstat = us.charstat()
>> print(charstat)

{'T': 1, 'h': 1, 'i': 2, 's': 4, ' ': 5, 'a': 2, 'm': 1, 'l': 2, 't': 2, 'e': 1, '!': 1, '1': 1, '2': 1, '3': 1}

Final notes

For full usage, look in the examples folder.

All the data is bases on Unicode version 13 definition files from www.unicode.org You should only create one instance of Unichar or Unistat, because it loads 60Mb of data into memory. It not only uses a lot of memory, it also takes some time (a second or so)

Have fun

/ Alex Skov Jensen

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unilyze-0.2.1.tar.gz (6.4 MB view details)

Uploaded Source

File details

Details for the file unilyze-0.2.1.tar.gz.

File metadata

  • Download URL: unilyze-0.2.1.tar.gz
  • Upload date:
  • Size: 6.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.0.post20200616 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for unilyze-0.2.1.tar.gz
Algorithm Hash digest
SHA256 ac6d9453b25a4fe0cb48f46e4051cb6d034f11f49a223c871446602e49de140f
MD5 ddee643dbd8ae3b2616676223b2f5f48
BLAKE2b-256 02608fce3179aade720d9a906d9c16af7ef2266aa792126c6027ffdaf8d916b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page