Skip to main content

Python module that identifies Chinese text as Simplified or Traditional.

Project description

About

Hanzi Identifier is a simple Python module that identifies a string of text has having Simplified or Traditional characters.

>>> import hanzidentifier
>>> hanzidentifier.has_chinese('Hello my name is John.')
False
>>> hanzidentifier.is_simplified('John说:你好!')
True
>>> hanzidentifier.is_traditional('John說:你好!')
True
>>> hanzidentifier.has_chinese('Country in Simplified: 国家. Country in Traditional: 國家.')
True

Here it is without the helper functions:

>>> hanzidentifier.identify('Hello my name is Thomas.') is hanzidentifier.UNKNOWN
True
>>> hanzidentifier.identify('Thomas 说:你好!') is hanzidentifier.SIMPLIFIED
True
>>> hanzidentifier.identify('Thomas 說:你好!') is hanzidentifier.TRADITIONAL
True
>>> hanzidentifier.identify('你好!') is hanzidentifier.BOTH
True
>>> hanzidentifier.identify('Country in Simplified: 国家. Country in Traditional: 國家.' ) is hanzidentifier.MIXED
True

hanzidentifier.identify has five possible return values:

  • hanzidentifier.UNKNOWN: there are no recognized Chinese characters in the string.
  • hanzidentifier.BOTH: the string is compatible with both Simplified and Traditional character systems.
  • hanzidentifier.TRADITIONAL: the string consists of Traditional characters.
  • hanzidentifier.SIMPLIFIED: the string consists of Simplified characters.
  • hanzidentifier.MIXED: the string consists of characters recognized solely as Traditional characters and also consists of characters recognized solely as Simplified characters.

Characters that aren’t found in CC-CEDICT are ignored when determining a string’s identity. Hanzi Identifier uses the CC-CEDICT data provided by Zhon to identify Chinese characters.

Because the Traditional and Simplified Chinese character systems overlap, a string containing Simplified characters could identify as hanzidentifer.SIMPLIFIED or hanzidentifier.BOTH depending on if the characters are also Traditional characters.

Hanzi Identifier’s functions accept and return unicode.

Install

Hanzi Identifier runs on Python 2.7 and 3. It requires Zhon to run.

$ pip install hanzidentifer

Bugs/Feature Requests

Hanzi Identifier uses its GitHub Issues page to track bugs, feature requests, and support questions.

Change Log

v1.0 (2014-04-12)

Version 1.0 merges some changes from Dragon Mapper. It is not backwards compatible with the previous versions of Hanzi Identifier (e.g. some of the constants are named differently).

v0.1 (2013-04-24)

  • Initial release.

License

Hanzi Identifier is released under the OSI-approved MIT License. See the file LICENSE.txt for more information.

Project details


Release history Release notifications

History Node

1.0.2

This version
History Node

1.0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
hanzidentifier-1.0.1.tar.gz (4.1 kB) Copy SHA256 hash SHA256 Source None Apr 14, 2014

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page