Skip to main content

Classify text as Chinese Simplified, Chinese Traditional, or Japanese using a statistical model.

Project description

CJClassifier

A focused, pure-Python library for distinguishing between Japanese, Chinese Simplified, and Chinese Traditional text using a statistical model of ideograph frequencies built from Japanese and Chinese language Wikipedia corpora.

No external dependencies. The bundled model is ~7 MB on disk and ~29 MB in memory (loaded once and cached).

Install

pip install cjclassifier

Usage

from cjclassifier import CJClassifier, CJLanguage

cjc = CJClassifier.load()

cjc.detect("今天天气很好,我们去公园散步")   # => CJLanguage.CHINESE_SIMPLIFIED
cjc.detect("今天天氣很好,我們去公園散步")   # => CJLanguage.CHINESE_TRADITIONAL
cjc.detect("事務所")                         # => CJLanguage.JAPANESE  (all Kanji)
cjc.detect("ひらがなとカタカナと")           # => CJLanguage.JAPANESE  (all kana)
cjc.detect("hello")                          # => CJLanguage.UNKNOWN

Detailed results

from cjclassifier.classifier import Results

results = Results()
cjc.detect("今天天气很好", results)

results.result             # CJLanguage.CHINESE_SIMPLIFIED
results.gap                # confidence gap: 0 = dead heat, 1 = no contest
results.total_scores       # per-language log-probability totals
results.to_short_string()  # e.g. "zh-hans:1.00,zh-hant:0.97,ja:0.85"

How it works

CJClassifier uses a unigram + bigram statistical model trained on the Chinese and Japanese Wikipedia corpora. For every character and character-pair in the CJ range, the model stores per-language log-probabilities. At classification time the library sums these log-probabilities across the input and picks the language with the highest score.

A Java implementation and the model-building tools are also available in the same repository: github.com/jlpka/cjclassifier

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cjclassifier-1.0.5.tar.gz (7.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cjclassifier-1.0.5-py3-none-any.whl (7.6 MB view details)

Uploaded Python 3

File details

Details for the file cjclassifier-1.0.5.tar.gz.

File metadata

  • Download URL: cjclassifier-1.0.5.tar.gz
  • Upload date:
  • Size: 7.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cjclassifier-1.0.5.tar.gz
Algorithm Hash digest
SHA256 20c390fa5d9a6d950bd6a446917d616837750b66c58ddf5371cc5717dbf8c128
MD5 09e57a435ba9ec823a973b330c8375cc
BLAKE2b-256 bc346f3a85ffc196a909149ff9c58ffb0b73d704f421ac23bb7a16849d89eab2

See more details on using hashes here.

File details

Details for the file cjclassifier-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: cjclassifier-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cjclassifier-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1e24f7e8a57a75d206a630fa706e33298c3303d20756d52fe70696218e64e31d
MD5 f716a2e203ad7e1af4e644c7291295be
BLAKE2b-256 4845f6ad5a7244b35c96dadbd7d6db6b6c9dbd6e362764b7556b514f95f6d34d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page