Skip to main content

Classify text as Chinese Simplified, Chinese Traditional, or Japanese using a statistical model.

Project description

CJClassifier

A focused, pure-Python library for distinguishing between Japanese, Chinese Simplified, and Chinese Traditional text using a statistical model of ideograph frequencies built from Japanese and Chinese language Wikipedia corpora.

No external dependencies. The bundled model is ~7 MB on disk and ~29 MB in memory (loaded once and cached).

Install

pip install cjclassifier

Usage

from cjclassifier import CJClassifier, CJLanguage

cjc = CJClassifier.load()

cjc.detect("今天天气很好,我们去公园散步")   # => CJLanguage.CHINESE_SIMPLIFIED
cjc.detect("今天天氣很好,我們去公園散步")   # => CJLanguage.CHINESE_TRADITIONAL
cjc.detect("事務所")                         # => CJLanguage.JAPANESE  (all Kanji)
cjc.detect("ひらがなとカタカナと")           # => CJLanguage.JAPANESE  (all kana)
cjc.detect("hello")                          # => CJLanguage.UNKNOWN

Detailed results

from cjclassifier.classifier import Results

results = Results()
cjc.detect("今天天气很好", results)

results.result             # CJLanguage.CHINESE_SIMPLIFIED
results.gap                # confidence gap: 0 = dead heat, 1 = no contest
results.total_scores       # per-language log-probability totals
results.to_short_string()  # e.g. "zh-hans:1.00,zh-hant:0.97,ja:0.85"

How it works

CJClassifier uses a unigram + bigram statistical model trained on the Chinese and Japanese Wikipedia corpora. For every character and character-pair in the CJ range, the model stores per-language log-probabilities. At classification time the library sums these log-probabilities across the input and picks the language with the highest score.

A Java implementation and the model-building tools are also available in the same repository: github.com/jlpka/cjclassifier

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cjclassifier-1.0.2.tar.gz (7.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cjclassifier-1.0.2-py3-none-any.whl (7.6 MB view details)

Uploaded Python 3

File details

Details for the file cjclassifier-1.0.2.tar.gz.

File metadata

  • Download URL: cjclassifier-1.0.2.tar.gz
  • Upload date:
  • Size: 7.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cjclassifier-1.0.2.tar.gz
Algorithm Hash digest
SHA256 59d703d25772ff5d19a6f98a4a8528b1ac1c46601af3b243d987ca074c3757f2
MD5 39c6ee8d081a9082f8ada6e608029d2d
BLAKE2b-256 8f82cb816f0a2c10d79285469e7966fca486a10536da4c34a908f55d7d7693dc

See more details on using hashes here.

File details

Details for the file cjclassifier-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: cjclassifier-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cjclassifier-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 36638a17e9d769cffd03932d0bab1106669d8e1200bb2e9e4938f3e8ecc05af6
MD5 25c0ecf0f84e721986f6199934b2d651
BLAKE2b-256 9bbc26945bf5822eb1530513f7842a5dccabf51a0bb04082dcd71a201f2c0748

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page