Skip to main content

汉字拼音转换工具,支持多进程共享

Project description

Pinyin-Bridge 使用文档

⚠️ 当前数据量有限(约 8000 条),后期会持续更新补全,敬请期待!

安装

pip install pinyin-bridge

开发模式安装:

git clone <repository>
cd pinyin-bridge
pip install -e .

使用 uv 的本地开发环境:

uv venv .venv
uv pip install -e ".[dev]"
./.venv/bin/python -B -m pytest -q

词级数据使用仓库内的 seed 文件维护,位于 pinyin_bridge/data/seed/*.tsv。修改后可用下面的脚本做基础校验:

./.venv/bin/python scripts/validate_seed_data.py

为保证数据正确性,seed 文件要求每条记录都带上 review_statussource_ref。当前运行时只加载 approved 数据,且任何与 pypinyin 默认词典不一致的条目都必须在 notes 中写明原因。

对于需要人工纠偏的词条,统一放在 pinyin_bridge/data/seed/manual_reviewed.tsv。这类数据应尽量少,只收高置信度、可说明来源的 override,并建议同步补到黄金样例测试中。

待审核候选词单独维护在 pinyin_bridge/data/seed/review_candidates.tsv,默认不会进入运行时。可以用下面的脚本导出审计结果:

./.venv/bin/python scripts/audit_review_candidates.py

这个报告会列出每个候选词当前是 missingcovered 还是 override,便于你按批次做人工确认。

快速开始

import pinyin_bridge

print(pinyin_bridge.pinyin("中国"))
# ['zhong', 'guo']

print(pinyin_bridge.pinyin("中国", style="num"))
# ['zhong1', 'guo2']

print(pinyin_bridge.pinyin("重", heteronym=True))
# [['zhong', 'chong', 'tong']]

print(pinyin_bridge.classify("中国", kind="final"))
# [{'name': '后鼻音韵母', 'detail': [['中', 'zhōng']]}, ...]

print(pinyin_bridge.segment("北京欢迎你"))
# ['北京', '欢迎', '你']

print(pinyin_bridge.reverse("zhong1"))
# [{'text': '中', 'pinyin': 'zhōng', 'scope': 'char'}]

公共 API

当前公开接口只有以下函数:

  • analyze(text, *, segment=True, heteronym=False)
  • pinyin(text, *, style="plain", unit="char", segment=True, heteronym=False)
  • classify(text, *, kind="all", segment=True)
  • segment(text)
  • reverse(query, *, tone="auto", scope="all")
  • add_word(text, pinyin, source="user")
  • add_words(words, source="user")
  • delete_word(text, source="user")
  • list_words(source="user")
  • clear_words()

pinyin()

pinyin_bridge.pinyin("中国")
# ['zhong', 'guo']

pinyin_bridge.pinyin("中国", style="mark")
# ['zhōng', 'guó']

pinyin_bridge.pinyin("北京欢迎你", style="mark", unit="word")
# ['běi jīng', 'huān yíng', 'nǐ']

pinyin_bridge.pinyin("重", heteronym=True)
# [['zhong', 'chong', 'tong']]

参数:

  • style: plain / num / mark
  • unit: char / word
  • segment: 是否先分词后分析
  • heteronym: 是否输出多音字候选;开启后,多音字位置返回候选列表,确定读音仍返回字符串。对多字上下文,分词结果和 pypinyin 的上下文消歧会先收敛候选,不会穷举整句所有组合。

analyze()

analysis = pinyin_bridge.analyze("中国")
print([char.chosen.final for token in analysis.tokens for char in token.chars if char.chosen])
# ['ong', 'uo']

返回 TextAnalysis,适合需要访问 token、候选读音和细粒度分析结果的场景。

classify()

pinyin_bridge.classify("中国")
# {
#   '声母': [...],
#   '韵母': [...],
#   '声调': [...],
# }

pinyin_bridge.classify("中国", kind="final")
# [{'name': '后鼻音韵母', 'detail': [['中', 'zhōng']]}, ...]

kind 支持 allinitialfinaltone

segment()

pinyin_bridge.segment("北京欢迎你")
# ['北京', '欢迎', '你']

reverse()

pinyin_bridge.reverse("zhong")
pinyin_bridge.reverse("zhong1")

用于按拼音反查文本,返回包含 textpinyinscope 的列表。

参数:

  • tone: auto / strict / ignore
  • scope: all / char / word

用户词典

pinyin_bridge.add_word("乔布斯", "qiáo bù sī")
pinyin_bridge.add_words({"苹果": "píng guǒ"})
print(pinyin_bridge.list_words())
pinyin_bridge.delete_word("乔布斯")
pinyin_bridge.clear_words()

默认情况下,用户词典会写到包内的 pinyin_bridge/data/dict.db。如果这个路径不可写,可以通过环境变量指定位置:

export PINYIN_BRIDGE_DB_PATH=/path/to/dict.db

当默认路径不可写且没有显式指定 PINYIN_BRIDGE_DB_PATH 时,库会自动回退到系统临时目录中的可写数据库。

多进程共享

使用 Redis 实现多进程间的用户词典同步。设置环境变量启用:

export PINYIN_BRIDGE_REDIS_URL=redis://localhost:6379

或者在代码中指定:

from pinyin_bridge.data.shared_store import SharedDict

# 强制 Redis
shared = SharedDict("pinyin_bridge:user", mode="redis")

# 强制内存
shared = SharedDict("pinyin_bridge:user", mode="memory")

# 自动检测(默认)
shared = SharedDict("pinyin_bridge:user")

示例操作:

shared.set("中国", "zhong guo")
shared.get("中国")  # 'zhong guo'
shared.bulk_set([("你好", "ni hao"), ("世界", "shi jie")])
shared.get_all()  # {'你好': 'ni hao', '世界': 'shi jie'}
shared.delete("中国")
shared.clear()
shared.is_using_redis  # True/False

所有用户词典操作(add_worddelete_wordclear_words)会自动同步到 Redis,多个 Worker 进程可实时共享。

CLI 用法

pinyin-bridge "中国"
pinyin-bridge "中国" --style num
pinyin-bridge "北京欢迎你" --unit word
pinyin-bridge "中国" --classify final
pinyin-bridge "zhong1" --reverse --json
pinyin-bridge --add "乔布斯:qiáo bù sī"
pinyin-bridge -f input.txt -o output.txt

支持选项:

  • --style {plain,num,mark}
  • --unit {char,word}
  • --classify [initial|final|tone|all]
  • --reverse
  • --json
  • --add TEXT:PINYIN
  • -f, --file
  • -o, --output

说明:

  • 普通拼音输出默认为空格分隔文本。
  • --classify--reverse 输出结构化 JSON。
  • --json 会输出带 text / result 的数组载荷。

版本

当前包版本:0.2.0

更新日志

0.2.0

  • 字典数据嵌入包内 (pinyin_bridge/data/dict.db)
  • 新增 Redis 共享存储支持多进程同步
  • seed 文件移至 part1_characters_new.tsvpart2_phrases_new.tsv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinyin_bridge-0.4.1.tar.gz (535.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pinyin_bridge-0.4.1-py3-none-any.whl (534.7 kB view details)

Uploaded Python 3

File details

Details for the file pinyin_bridge-0.4.1.tar.gz.

File metadata

  • Download URL: pinyin_bridge-0.4.1.tar.gz
  • Upload date:
  • Size: 535.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pinyin_bridge-0.4.1.tar.gz
Algorithm Hash digest
SHA256 7a338c466a2c4d2d5bc187a5a4ff9a68b9e4984e92a316ea3b084461a39981f8
MD5 be034a4cb9d90a01462a78d0062ce845
BLAKE2b-256 40ec281c01a82bf0e23ef5d5080de7cb91a2471a8f7b432f3c0d267e9440427b

See more details on using hashes here.

File details

Details for the file pinyin_bridge-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: pinyin_bridge-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 534.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pinyin_bridge-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5c660c4252f68f20a311a926c3d4d9307d9a9e9fef76e98ac6643f2951bdc726
MD5 20c491bc979ff7c4cdaa54ab7108ff7e
BLAKE2b-256 c4a1179056adc4d4cf77b47bd67d6b060c3a0e9ebffe963c66c95521dda9ea41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page