Skip to main content

Use C++ and pybind11 to Speed up jieba<Chinese Words Segementation Utilities>

Project description

PyPI Downloads

jieba_fast_dat: 高效能中文分詞與詞性標註工具

由於自己在使用時發現隨著字典的增加, 字典載入速度越來越久(甚至超過 10 秒), 且原始 jiebajieba_fast 由於久未維護, 有些依賴已經與現在主流 python 版本已經有警告訊息出現(看著不舒服)

所以在支援原有功能的狀態下(大部分), 進行更新與開發, 主要優化內容如下:

技術優化內容

  • DAT 詞典結構: 詞典採用均 Double-Array Trie (DAT) 結構,實現低記憶體佔用和極速查詢。
  • C++ 核心算法: 關鍵算法(如 Viterbi)在 C++ 中實現,並透過 pybind11 無縫暴露給 Python,結合了 Python 的靈活性和 C++ 的高效能。
  • CPU 優先原則: 所有算法和庫的選擇都符合 CPU 執行效率,不依賴 GPU。
  • 繁體強化: 將預設的系統字典與 idf 均直接改用 jieba 原廠提供的繁體優化字典, 無須額外修改設定

重大差異:為了極速,我們做出一個取捨

  • Python 版本限制:我們擁抱現代開發!僅支持 Python >= 3.10
  • Platform Support: 支援 Linux 與 macOS (Intel/Apple Silicon)。暫不支援 Windows。

changelog

  • pypi 累積安裝次數: 3k(20260506)
  • 20260506 支援 macOS 與改善 Linux GLIBC 相容性,加入 GitHub Actions 自動化建置,upgrade version to 0.59。
  • 20251222 優化快取結構, refactor c++, 大幅提昇效能, upgrade version to 0.58
  • 20251221 優化使用者字典載入, 調整整體結構更多轉入c++ , 修復 IO 邏輯, 再次提昇效能, upgrade version to 0.57
  • 20251204 強化cedar, 增加自定義字典cache機制, upgrade version to 0.56
  • 20251124 整體大幅重構, 確保結果與原生jieba相同, 修復字典錯誤, upgrade version to 0.55
  • 20251106 [0.54] 核心分詞引擎重構,將 Viterbi 完整遷移至 C++ 實現,執行效能大幅提升,並升級至 C++17 標準。

數字會說話:最高 62 倍速 的極致效能!

我們使用大型繁體字典(包含 130 萬筆資料)進行了深度效能對比。結果顯示,jieba_fast_dat 在各項指標上均徹底超越了原始 jieba

效能對比數據 (Final Summary: Performance Comparison)

評測項目 (Metric) 原生 Jieba jieba_fast_dat 加速倍率 (Speedup)
主字典載入 (Cold Init) 2.579 s 1.035 s 2.49x
主字典載入 (With Cache) 1.847 s 0.021 s 86.07x
HMM 模型載入 (Import Load) 0.110 s 0.062 s 1.78x
自定義字典載入 (No Cache) 4.508 s 1.591 s 2.83x
自定義字典載入 (With Cache) 5.592 s 0.011 s 515.88x
分詞速度 (HMM=False) 0.843 s 0.014 s 61.27x
詞性標注 (HMM=False) 0.909 s 0.036 s 24.99x
分詞速度 (HMM=True) 0.962 s 0.015 s 62.93x
詞性標注 (HMM=True) 1.013 s 0.040 s 25.33x

測試環境:Linux, Python 3.12, 採用大型繁體字典進行測試,分詞/標註數據為多次執行之總和時間。

🚀 安裝

pypi 安裝最新

pip install jieba_fast_dat

github 安裝最新

pip install git+https://github.com/carycha/jieba_fast_dat

github 安裝指定版號

pip install git+https://github.com/carycha/jieba_fast_dat@0.58

🛠️ 使用方式

基本分詞

import jieba_fast_dat as jieba

text = "東北季風發威!4縣市豪大雨特報「雨下整夜」 一路濕到這天"
print("精確模式:", "/".join(jieba.cut(text)))
print("全模式:", "/".join(jieba.cut(text, cut_all=True)))
print("搜尋引擎模式:", "/".join(jieba.cut_for_search(text)))

詞性標註

import jieba_fast_dat.posseg as pseg

text = "東北季風發威!4縣市豪大雨特報「雨下整夜」 一路濕到這天"
words = pseg.cut(text)
for word, flag in words:
    print(f"{word}/{flag}")

載入使用者詞典

import jieba_fast_dat as jieba

# userdict.txt 範例內容:
# 創新模式 3
# 程式設計 5 n
jieba.load_userdict("userdict.txt")
print("載入使用者詞典後:", "/".join(jieba.cut("雨要下到什麼時候?氣象署:今雨勢最猛 週日長榮馬拉松要穿雨衣")))

分詞與詞性標註結果比較

統一用以下文字測試

東北季風發威!4縣市豪大雨特報「雨下整夜」 一路濕到這天

分詞差異

模式 原始 jieba_fast jieba_fast_dat
HMM OFF 東/北/季/風/發/威/!/4/縣/市/豪/大雨/特/報/「/雨/下/整夜/」/ /一路/濕/到/這/天 東北/季風/發威/!/4/縣市/豪/大雨/特報/「/雨/下/整夜/」/ /一路/濕/到/這天
HMM ON 東北/季風/發威/!/4/縣市/豪/大雨/特報/「/雨下/整夜/」/ /一路/濕到/這天 東北/季風/發威/!/4/縣市/豪/大雨/特報/「/雨下/整夜/」/ /一路/濕到/這天

詞性標注差異

模式 原始 jieba_fast jieba_fast_dat
HMM OFF 東/zg 北/ns 季/n 風/zg 發/zg 威/ns !/x 4/eng 縣/x 市/n 豪/n 大雨/n 特/d 報/zg 「/x 雨/n 下/f 整夜/b 」/x /x 一路/m 濕/x 到/v 這/zg 天/q 東北/ns 季風/n 發威/v !/x 4/eng 縣市/n 豪/n 大雨/n 特報/n 「/x 雨/n 下/f 整夜/b 」/x  /x 一路/m 濕/x 到/v 這天/r
HMM ON 東北/ns 季風/n 發威/v !/x 4/m 縣/n 市豪/n 大雨/n 特報/n 「/x 雨/n 下/f 整夜/b 」/x  /x 一路/m 濕到/v 這天/r 東北/ns 季風/n 發威/v !/x 4/x 縣市/n 豪/n 大雨/n 特報/n 「/x 雨/n 下/f 整夜/b 」/x  /x 一路/m 濕到/x 這天/r

支持與鼓勵

如果您重視效率、速度、穩定性,並認同我們為中文 NLP 提昇的小小貢獻:

⭐ 點擊 Star! 您的肯定是我們持續開發的最大動力!

📢 轉發擴散! 讓所有還在飽受載入慢之苦的開發者知道這個工具!

🤝 提出 Issue/PR! 歡迎加入我們,讓這個神器更加完美!

📄 許可證

jieba_fast_dat 採用 MIT 許可證。詳情請參閱 LICENSE 文件。

🤝 貢獻

歡迎任何形式的貢獻!如果您有任何建議、功能請求或錯誤報告,請隨時提出 Issue 或提交 Pull Request。

🌟 鳴謝

本專案基於 jiebajieba_fast 庫進行優化和增強。感謝原作者及所有貢獻者。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jieba_fast_dat-0.59.tar.gz (8.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jieba_fast_dat-0.59-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

jieba_fast_dat-0.59-cp313-cp313-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

jieba_fast_dat-0.59-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

jieba_fast_dat-0.59-cp312-cp312-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

jieba_fast_dat-0.59-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

jieba_fast_dat-0.59-cp311-cp311-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

jieba_fast_dat-0.59-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

jieba_fast_dat-0.59-cp310-cp310-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file jieba_fast_dat-0.59.tar.gz.

File metadata

  • Download URL: jieba_fast_dat-0.59.tar.gz
  • Upload date:
  • Size: 8.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for jieba_fast_dat-0.59.tar.gz
Algorithm Hash digest
SHA256 a7a772b3c0b2f00534afef0efe142a0c1b1b93c6de7434ddf86339c7cb51f788
MD5 f7c5e09071af642198aa0ece07027919
BLAKE2b-256 c501f163ad63dfb227a7d5e9d5e4f7c1176aa7feae5312f0b68a44f9817cdbcc

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59.tar.gz:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jieba_fast_dat-0.59-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jieba_fast_dat-0.59-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1b933cfff4d854656f6f9e895202702cd7ea1370138a8e36a1cf0f23a538986e
MD5 fc1ebcdf6fb8daa93aec5d91dfc22b83
BLAKE2b-256 d5ac3e795b99b8500bb3c618f43ae3a39ed237c8a089c1beb4e5c3c06ae2fa25

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jieba_fast_dat-0.59-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for jieba_fast_dat-0.59-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0ee72ba1e6e40fee7028c9fbfc4ceb853491e5e599033d553a712a3ba7c8e993
MD5 1c3877a2838025f05744df2c80a63073
BLAKE2b-256 36968c3f91646501d1fb8150c535d9ecd82c0e505e71da12c488411313b5b9b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jieba_fast_dat-0.59-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jieba_fast_dat-0.59-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f4e389c2a1570325490ad14d499bbeeb473c82cb00b12d2d1b1014da82a5928d
MD5 af1ee14043d063aadfa3232354b891d7
BLAKE2b-256 431a35abaf79c41695d82d0a64efec1f045c062f774c54aac6b5f00de8f8a8b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jieba_fast_dat-0.59-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for jieba_fast_dat-0.59-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8135b9c9084e62f18444239f0e6eb34ed1a1e2373595af421eba2f888773a4f7
MD5 d45da7eeb53c4591f993a81b88f8ba62
BLAKE2b-256 8da0945649b24a4a3f031927fc9a91c494127d93b65563ff86645d661c5f9dfb

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jieba_fast_dat-0.59-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jieba_fast_dat-0.59-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b5233e8c3b9716e56fd2a0f4c32b1cdca7b193d636804712d48df5fd28f90268
MD5 080a2970963d4564921b4f3566e46934
BLAKE2b-256 c19df62713686170b3a685c74122d41ad0b3ae6fbe113d9f6e9eee0a46b5e5f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jieba_fast_dat-0.59-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for jieba_fast_dat-0.59-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8cbd5b8f46d368e73061d051dea8ca3ae2414239f9dbeeeeb7530830fea4226d
MD5 5b8e4fe71c0e46cb985ef0b555c170da
BLAKE2b-256 13c9291504c6d538f8efa0d2adcd915917506024895d27d7109ffcc68a67accf

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jieba_fast_dat-0.59-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jieba_fast_dat-0.59-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9f09ef9fefe0b192d0d770f305474a604fe84ae19e6c949b322f1b43d135267b
MD5 982c06fa8d13a4e5d6d337884518a9f1
BLAKE2b-256 9de8650f01a489f7b2fc7ebca2b1f890c881a3ab538e7bdfd327bc3337aaacd1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jieba_fast_dat-0.59-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for jieba_fast_dat-0.59-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4890c3d38cbfa99462fcc89fdebd195cd4ba9d505ed774908963cd9cec77ea3d
MD5 a9295d62accacd694f7af20e8886913f
BLAKE2b-256 2ad8cfe7fc624dd9f00ef5a1d1a245275a257cf7034e6406e879012171753392

See more details on using hashes here.

Provenance

The following attestation bundles were made for jieba_fast_dat-0.59-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: wheels.yml on carycha/jieba_fast_dat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page