Traditional Chinese text preprocessing for search engines — CKIP segmentation + bigram indexing with pluggable domain dictionaries

These details have not been verified by PyPI

Project links

Project description

trad-zh-search

搜尋引擎無關的繁體中文文本預處理工具——CKIP 分詞 + bigram 索引生成，附可插拔的領域字典系統。

需要 Python 3.9+ · 屬於 notoriouslab 開源工具組的一員。

English README

為什麼需要這個工具

主流搜尋引擎（Meilisearch、Elasticsearch、SQLite FTS5）的中文支援大多依賴 jieba（簡體中文訓練），對繁體中文分詞品質差——「稱義」「靈恩」「聖靈充滿」這類術語切不好，簡繁正規化更不可靠。

trad-zh-search 從一個 8,000+ 篇繁體中文文章搜尋系統的實戰經驗提取，把核心預處理方案包裝成通用工具：

特色	說明
CKIP 分詞	ckip-transformers（albert-tiny），比 jieba 更適合繁體中文
Bigram 索引	CJK 字符滑動窗口 bigram，確保子字串都能搜到
可選 CKIP	沒裝 CKIP 時自動退回 bigram-only，降級但可用
領域字典	YAML 格式，可插拔。首發基督教繁中字典（1,394 自訂詞）
自動建字典	丟一批文件進去，CKIP NER 自動提取專有名詞
Meilisearch Adapter	TokenResult → 多欄位文件格式，一行搞定

快速開始

pip install trad-zh-search

# （選裝）CKIP 分詞支援
pip install trad-zh-search[ckip]

CKIP 安裝須知

trad-zh-search[ckip] 會一併安裝 PyTorch（約 700MB+），這是 ckip-transformers 的底層依賴。首次呼叫 tokenize() 時會自動從 HuggingFace 下載 albert-tiny 模型（約 50MB），需要對外網路。離線環境請參考 HuggingFace offline mode。

不裝 CKIP 也能用——自動退回 bigram-only 模式，仍比 jieba 更適合繁體中文。

三行程式碼

from trad_zh_search import tokenize

result = tokenize("轉型正義委員會的調查報告")
print(result.bigrams)   # ['轉型', '型正', '正義', '義委', '委員', '員會', '會的', '的調', '調查', '查報', '報告']
print(result.tokens)    # CKIP 分詞結果（有裝 CKIP 時）
print(result.used_ckip) # True / False

搭配領域字典

from trad_zh_search import tokenize, load_dictionary

# 載入內建基督教繁中字典
dict = load_dictionary("christian-zh-hant")
result = tokenize("台北靈糧堂的主日崇拜", dictionary=dict)
# CKIP + 自訂詞合併：「台北靈糧堂」不會被切成「台北/靈糧/堂」

從文件自動建字典

from trad_zh_search import build_dictionary, save_dictionary

# 丟一批文本 → CKIP NER 提取專有名詞 → 產生字典
texts = [open(f).read() for f in my_articles]
my_dict = build_dictionary(texts, min_freq=2)
save_dictionary(my_dict, "my_domain.yaml")  # 存檔可人工微調

Meilisearch 整合

from trad_zh_search import tokenize, load_dictionary
from trad_zh_search.adapters.meilisearch import to_meilisearch, to_meilisearch_synonyms

dict = load_dictionary("christian-zh-hant")

# 文件預處理
doc = to_meilisearch(
    fields={
        "title": tokenize(title, dictionary=dict),
        "content": tokenize(content, dictionary=dict),
    },
    original={"id": doc_id, "title": title, "content": content},
)
# → {"id": ..., "title": ..., "title_ckip": "...", "title_bigram": "...",
#    "content": ..., "content_ckip": "...", "content_bigram": "...", ...}

# 同義詞設定
synonyms = to_meilisearch_synonyms(dict)
# → {"敬拜": ["崇拜", "主日"], "崇拜": ["敬拜", "主日"], ...}

搜尋建議：將 title_ckip、title_bigram、content_ckip、content_bigram 都加入 searchableAttributes。實戰 benchmark 顯示 CKIP 和 bigram 互補——bigram 提供高召回率，CKIP 提供高精確率。

TokenResult

@dataclass
class TokenResult:
    original: str        # 輸入原文（截斷後）
    tokens: list[str]    # CKIP 分詞結果（無 CKIP 時為空 list）
    bigrams: list[str]   # CJK bigrams（永遠產生）
    used_ckip: bool      # 是否使用了 CKIP

字典格式

YAML 格式，三個可選區塊：

# 自訂分詞詞庫（Phase 1 核心）
ckip_custom_words:
  - 轉型正義
  - 國家人權委員會

# 別名映射（未來 entity-resolver 用）
aliases:
  人權會: 國家人權委員會

# 同義詞組（adapter 可直接輸出為搜尋引擎 synonyms）
synonyms:
  判決: [裁定, 裁判]

API 參考

函式	說明
`tokenize(text, dictionary?, max_chars?)`	分詞 + bigram，回傳 TokenResult
`tokenize_batch(texts, dictionary?, batch_size?)`	批次版本
`load_dictionary(name)`	載入內建字典
`load_dictionary_file(path)`	載入 YAML 字典檔
`merge_dictionaries(*dicts)`	合併多個字典
`save_dictionary(dict, path)`	儲存字典為 YAML
`build_dictionary(texts, min_freq?)`	NER 自動提取字典（需要 CKIP）
`to_meilisearch(fields, original?)`	TokenResult → Meilisearch 文件格式
`to_meilisearch_synonyms(dict)`	同義詞 → Meilisearch 雙向格式

授權

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Mar 23, 2026

0.1.1

Mar 23, 2026

This version

0.1.0

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trad_zh_search-0.1.0.tar.gz (38.5 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trad_zh_search-0.1.0-py3-none-any.whl (25.4 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file trad_zh_search-0.1.0.tar.gz.

File metadata

Download URL: trad_zh_search-0.1.0.tar.gz
Upload date: Mar 23, 2026
Size: 38.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for trad_zh_search-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fdfb00acad5718d65274ae0915432f88898be3e1fc6f46d9cfdc9d691c5f72d7`
MD5	`209ef4f78d001f5a02c43433ad44ce95`
BLAKE2b-256	`a5f55fea45a9a9c7173822e4ec22acb6b8e9628baa8524f17cc1fd948012f93a`

See more details on using hashes here.

File details

Details for the file trad_zh_search-0.1.0-py3-none-any.whl.

File metadata

Download URL: trad_zh_search-0.1.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 25.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for trad_zh_search-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`54a82f67803f91f884c23462c52c3dfaefaf20fa326fc5219719a1bf1128517a`
MD5	`156e0dd453ee5bbd5c6a7f06bc92f9e8`
BLAKE2b-256	`7b66478077135695d5012f333287c0b9506bfe3c2790dbb0e15f93299bb53147`

See more details on using hashes here.

trad-zh-search 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

trad-zh-search

為什麼需要這個工具

快速開始

三行程式碼

搭配領域字典

從文件自動建字典

Meilisearch 整合

TokenResult

字典格式

API 參考

授權

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes