Classify text as Chinese Simplified, Chinese Traditional, or Japanese using a statistical model.
Project description
CJClassifier
A focused, pure-Python library for distinguishing between Japanese, Chinese Simplified, and Chinese Traditional text using a statistical model of ideograph frequencies built from Japanese and Chinese language Wikipedia corpora.
No external dependencies. The bundled model is ~7 MB on disk and ~29 MB in memory (loaded once and cached).
Install
pip install cjclassifier
Usage
from cjclassifier import CJClassifier, CJLanguage
cjc = CJClassifier.load()
cjc.detect("今天天气很好,我们去公园散步") # => CJLanguage.CHINESE_SIMPLIFIED
cjc.detect("今天天氣很好,我們去公園散步") # => CJLanguage.CHINESE_TRADITIONAL
cjc.detect("事務所") # => CJLanguage.JAPANESE (all Kanji)
cjc.detect("ひらがなとカタカナと") # => CJLanguage.JAPANESE (all kana)
cjc.detect("hello") # => CJLanguage.UNKNOWN
Detailed results
from cjclassifier.classifier import Results
results = Results()
cjc.detect("今天天气很好", results)
results.result # CJLanguage.CHINESE_SIMPLIFIED
results.gap # confidence gap: 0 = dead heat, 1 = no contest
results.total_scores # per-language log-probability totals
results.to_short_string() # e.g. "zh-hans:1.00,zh-hant:0.97,ja:0.85"
How it works
CJClassifier uses a unigram + bigram statistical model trained on the Chinese and Japanese Wikipedia corpora. For every character and character-pair in the CJ range, the model stores per-language log-probabilities. At classification time the library sums these log-probabilities across the input and picks the language with the highest score.
A Java implementation and the model-building tools are also available in the same repository: github.com/jlpka/cjclassifier
License
Apache License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cjclassifier-1.0.5.tar.gz.
File metadata
- Download URL: cjclassifier-1.0.5.tar.gz
- Upload date:
- Size: 7.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20c390fa5d9a6d950bd6a446917d616837750b66c58ddf5371cc5717dbf8c128
|
|
| MD5 |
09e57a435ba9ec823a973b330c8375cc
|
|
| BLAKE2b-256 |
bc346f3a85ffc196a909149ff9c58ffb0b73d704f421ac23bb7a16849d89eab2
|
File details
Details for the file cjclassifier-1.0.5-py3-none-any.whl.
File metadata
- Download URL: cjclassifier-1.0.5-py3-none-any.whl
- Upload date:
- Size: 7.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e24f7e8a57a75d206a630fa706e33298c3303d20756d52fe70696218e64e31d
|
|
| MD5 |
f716a2e203ad7e1af4e644c7291295be
|
|
| BLAKE2b-256 |
4845f6ad5a7244b35c96dadbd7d6db6b6c9dbd6e362764b7556b514f95f6d34d
|