Skip to main content

ZHLID: Fine-grained Chinese Language Identification Package

Project description

ZHLID: Fine-grained Chinese Language Identification Package

Model GitHub Team

ZHLID is an open-source, model-based language identification tool specialized in fine-grained Chinese varieties.

Features

Unlike general-purpose LID tools, ZHLID focuses on distinguishing between closely related Chinese varieties, including:

Traditional Chinese (繁體中文) – written in the traditional character set, used in formal and classical texts.
Simplified Chinese (簡體中文) – written in the simplified character set, designed for easier reading and writing.
Cantonese (粵語) – written form reflecting spoken Cantonese with unique vocabulary and grammar.
Classical Chinese (Traditional) (繁體文言文) – literary Chinese in traditional characters with concise, classical syntax.
Classical Chinese (Simplified) (簡體文言文) – literary Chinese in simplified characters, used in modern reprints and education.

This makes ZHLID useful for linguistic research, corpus analysis, preprocessing for NLP tasks, or any application requiring accurate recognition of Chinese textual forms.

The following table compares ZHLID with other popular LID tools supporting Chinese detection:

Identification General Chinese Traditional Chinese Simplified Chinese Classical Chinese Cantonese
ZHLID (ours)
langdetect
GlotLID
langid.py
CLD3
Lingua

Installation

Install via pip

pip install zhlid

Install from source

pip install git+https://github.com/Musubi-ai/ZHLID

Usage

from zhlid import load_model


model = load_model("MusubiAI/ZHLID", device_map="auto")

text = [
    "王夫之者,字而農,衡陽人,明末清初哲學家。張獻忠陷衡州,夫之匿南嶽,賊執其父以為質。夫之自引刀遍刺肢體,舁往易父。",
    "金山阿伯係清末民初時嘅一種現象。金山阿伯係指嗰啲生活喺廣東地方,因為搵唔夠錢畀家人生活,要出洋到舊金山或新金山做苦工,掘金礦。",
    "燧人氏,古之三皇,有巢氏之子。 风姓,讳允婼,华夏族。燧人钻火,教人熟食,立国曰燧明,为后世奉为「火祖」,号燧皇。立一百一十年,崩,子伏羲嗣。\n\n**引据**\n《风俗通义·皇霸篇》\n*",
    "在量子力学中,量子涨落(quantum fluctuation。或量子真空涨落,真空涨落)是在空间任意位置对于能量的暂时变化。 \n从维尔纳·海森堡的不确定性原理可以推导出这结论。",
    "在政治中,政治議程是政府官員以及政府以外的個人在任何給定時間都認真關注的主題或問題/議題的列表。"
]

res = model.predict(text, batch_size=5)
print(res)
# [
#     {'label': 'zhtw_classical', 'confidence_score': 0.9999634027}, 
#     {'label': 'yue', 'confidence_score': 0.9376096725}, 
#     {'label': 'zhcn_classical', 'confidence_score': 0.9999793768}, 
#     {'label': 'zhcn', 'confidence_score': 0.9944804907}, 
#     {'label': 'zhtw', 'confidence_score': 0.9998573065}
# ]

Evaluation

To evaluate ZHLID with our benchmark dataset, simply run:

python evaluate.py

We compare our top-1 accuracy result with GlotLID and langdetect. Note that since GlotLID only provides a general "cmn_Hani" label for Chinese, its performance on Traditional and Simplified Chinese is measured by whether it outputs this label for both categories.

Top-1 accuracy Traditional Chinese Simplified Chinese Classical Chinese (Traditional) Classical Chinese (Simplified) Cantonese
ZHLID (ours) 1.0 1.0 0.9 1.0 0.96
GlotLID 0.98 0.98 - - 0.9
langdetect 0.3 0.9 - - -

Citation

If you use ZHLID in your research, please cite this repository:

@misc{zhlid2025 ,
  title  = {ZHLID: Fine-grained Chinese Language Identification Package},
  author = {Lung-Chuan Chen},
  year   = {2025},
  howpublished = {\url{https://github.com/Musubi-ai/ZHLID}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhlid-0.1.0.tar.gz (505.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zhlid-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file zhlid-0.1.0.tar.gz.

File metadata

  • Download URL: zhlid-0.1.0.tar.gz
  • Upload date:
  • Size: 505.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.21

File hashes

Hashes for zhlid-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8c777c03e724e802612fcb7ed576f805ec5fff1122f78c804e64e08118d25c73
MD5 2433fb6931d84f5c538497d3ee7d85a0
BLAKE2b-256 6a7c777125072fc2b8f6d17ff84ed641d6645af04dbbf73ce7854fd4afce2dbc

See more details on using hashes here.

File details

Details for the file zhlid-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: zhlid-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.21

File hashes

Hashes for zhlid-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06b86292e1255ba9f0f4c4175eba4f07754e865008dc0c9ccc516f46d06f3b31
MD5 1d6bc0daad02362e58ed23780c5f29e0
BLAKE2b-256 5bab776c00da2d034a5743376ddf16e66ca518f4ce597111c4a68d8f82db0a3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page