Skip to main content

A unicode and character set explorer.

Project description

charex is a Unicode and character set explorer for understanding issues with character set translation and Unicode normalization.

Why Did I Make This?

I find the ambiguity of text data interesting. In memory it’s all ones and zeros. There is nothing inherent to the data that makes 0x20 mean a space character, but we’ve mostly agreed that it does. That “mostly” part is what’s interesting to me, and it’s where a lot of fun problems lie.

How Do I Use This?

It’s in PyPI, so you can install it with pip, as long as you are using Python 3.11 or higher:

pip install charex

charex has four modes of operation:

  • Direct command line invocation,

  • An interactive shell,

  • A graphical user interface (GUI),

  • An application programming interface (API).

Command Line

To get help for direct invocation from the command line:

$ charex -h

Interactive Shell

To launch the interactive shell:

$ charex

That will bring you to the charex shell:

Welcome to the charex shell.
Press ? for a list of comands.

charex>

From here you can type ? to see the list of available commands:

Welcome to the charex shell.
Press ? for a list of comands.

charex> ?
The following commands are available:

  * cd: Decode the given address in all codecs.
  * ce: Encode the given character in all codecs.
  * cl: List registered character sets.
  * clear: Clear the terminal.
  * ct: Count denormalization results.
  * dm: Build a denormalization map.
  * dn: Perform denormalizations.
  * dt: Display details for a code point.
  * el: List the registered escape schemes.
  * es: Escape a string using the given scheme.
  * fl: List registered normalization forms.
  * nl: Perform normalizations.
  * sh: Run in an interactive shell.
  * up: List the Unicode properties.
  * uv: List the valid values for a Unicode property.

For help on individual commands, use "help {command}".

charex>

And then type help then a name of one of the commands to learn what it does:

charex> help dn
usage: charex dn [-h] [-m MAXDEPTH] [-n NUMBER] [-r] [-s SEED] form base

Denormalize a string.

positional arguments:
  form                  The normalization form for the denormalization. Valid
                        options are: casefold, nfc, nfd, nfkc, nfkd.
  base                  The base normalized string.

options:
  -h, --help            show this help message and exit
  -m MAXDEPTH, --maxdepth MAXDEPTH
                        Maximum number of reverse normalizations to use for
                        each character.
  -n NUMBER, --number NUMBER
                        Maximum number of results to return.
  -r, --random          Randomize the denormalization.
  -s SEED, --seed SEED  Seed the randomized denormalization.

charex>

GUI

To launch the charex GUI:

$ charex gui

API

To import charex into your Python script to get a summary of a Unicode character:

>>> import charex
>>>
>>>
>>> value = 'a'
>>> char = charex.Character(value)
>>> print(char.summarize())
a U+0061 (LATIN SMALL LETTER A)

What Version of Unicode Does This Support?

Parts of charex rely on unicodedata in the Python Standard Library. This limits charex to supporting the version supported by the version of Python you are running. There may be a bit of a lag as new Python versions are released, but as of this release of charex it supports:

  • Python 3.11: Unicode 14.0

  • Python 3.12: Unicode 15.0

  • Python 3.13: Unicode 15.1

Support for Unicode 16.0 should come around the release of Python 3.14.

What Is Left To Do?

The following features are planned for the v0.2.4 or later releases:

  • Support for Unicode v16.0 for Python 3.14.

  • Emoji combiner.

  • Basic doctests for all public classes and functions.

  • Web API.

  • Registration for character set codecs.

The list of Unicode properties can be found here: Index

The list of Unihan properties is here: tr38

Changes in v0.2.3

The following are the changes in v0.2.3:

  • Move dependency management to poetry.

  • Move package into /src directory.

  • Use tox for regression testing.

  • Support Unicode 15.0 for running under Python 3.12.

    • Add Unicode 15.0 files.

    • Added “kalternatetotalstrokes” property.

    • Moved “kcihait” property source.

    • Added “idna2008” property.

  • Support Unicode 15.1 for running under Python 3.13.

    • Add Unicode 15.1 files.

    • Remove seven Unihan properties.

      • kHKSCS

      • kIRGDaiKanwaZiten

      • kKPS0

      • kKPS1

      • kKSC0

      • kKSC1

      • kRSKangXi

    • Add six Unihan properties.

      • kJapanese

      • kMojiJoho

      • kSMSZD2003Index

      • kSMSZD2003Readings

      • kVietnameseNumeric

      • kZhuangNumeric

    • Add “IDS_Unary_Operator” property.

    • Add “ID_Compat_Math_Start” property.

    • Add “ID_Compat_Math_Continue” property.

    • Add “NFKC_Simple_Casefold” property.

    • Check if multiple “kPrimaryNumeric” values are supported (see U+5146 and U+79ED).

  • Use Unicode version supported by Python version.

    • Updated the path_map to handle different Unicode versions.

    • Updated the prop_map to handle different Unicode versions.

    • Can specify the Unicode version of a FileCache.

    • Create a FileCache for the Unicode version supported by the running Python on launch.

  • Generate denormalizations for each version.

  • Return an AttributeError rather than KeyError when an attribute doesn’t exist.

How Do I Run the Tests?

charex is using the pytest package for unit testing. It also comes with a makefile that automates testing. So, to run the tests:

  • Install poetry: pip install poetry

  • Install the development dependencies: poetry install

  • To run just the unit tests: make test

  • To run the full test suite: make pre

Common Problems

ModuleNotFoundError: No module name ‘_tkinter’ error

If you get the above error when running charex or its tests, it’s likely your Python install doesn’t have tkinter linked. How you fix it depends upon your Python install. If you are using Python 3.13 installed with homebrew on macOS, you can probably fix it with:

brew install python-tk@3.13

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charex-0.2.3.tar.gz (44.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

charex-0.2.3-py3-none-any.whl (44.3 MB view details)

Uploaded Python 3

File details

Details for the file charex-0.2.3.tar.gz.

File metadata

  • Download URL: charex-0.2.3.tar.gz
  • Upload date:
  • Size: 44.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for charex-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1c433b0e72e20267c189e9b7b7a90751a712b73c7f89f944510e25ec7b57f5ea
MD5 c2276e9071210ae1b6778fa781c0426d
BLAKE2b-256 5ea6271c397ad7fee5f695da8ae3828725235044b9de51c4baf1cf5b68705736

See more details on using hashes here.

File details

Details for the file charex-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: charex-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 44.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for charex-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ac2f412df820ba08415f540180200fe05e2a1ea245caa86717b8ce2717827eaf
MD5 c78b7435382ed6b96423c24f569f311c
BLAKE2b-256 130122e4b6bc035e6654763af3b1b2b0f042584424fc6f6eeb700e181e789b86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page