Skip to main content

汉字拼音转换工具,支持多进程共享

Project description

Pinyin-Bridge 使用文档

⚠️ 当前数据量有限(约 8000 条),后期会持续更新补全,敬请期待!

安装

pip install pinyin-bridge

开发模式安装:

git clone <repository>
cd pinyin-bridge
pip install -e .

使用 uv 的本地开发环境:

uv venv .venv
uv pip install -e ".[dev]"
./.venv/bin/python -B -m pytest -q

词级数据使用仓库内的 seed 文件维护,位于 pinyin_bridge/data/seed/*.tsv。修改后可用下面的脚本做基础校验:

./.venv/bin/python scripts/validate_seed_data.py

为保证数据正确性,seed 文件要求每条记录都带上 review_statussource_ref。当前运行时只加载 approved 数据,且任何与 pypinyin 默认词典不一致的条目都必须在 notes 中写明原因。

对于需要人工纠偏的词条,统一放在 pinyin_bridge/data/seed/manual_reviewed.tsv。这类数据应尽量少,只收高置信度、可说明来源的 override,并建议同步补到黄金样例测试中。

待审核候选词单独维护在 pinyin_bridge/data/seed/review_candidates.tsv,默认不会进入运行时。可以用下面的脚本导出审计结果:

./.venv/bin/python scripts/audit_review_candidates.py

这个报告会列出每个候选词当前是 missingcovered 还是 override,便于你按批次做人工确认。

快速开始

import pinyin_bridge

print(pinyin_bridge.pinyin("中国"))
# ['zhong', 'guo']

print(pinyin_bridge.pinyin("中国", style="num"))
# ['zhong1', 'guo2']

print(pinyin_bridge.pinyin("重", heteronym=True))
# [['zhong', 'chong', 'tong']]

print(pinyin_bridge.classify("中国", kind="final"))
# [{'name': '后鼻音韵母', 'detail': [['中', 'zhōng']]}, ...]

print(pinyin_bridge.segment("北京欢迎你"))
# ['北京', '欢迎', '你']

print(pinyin_bridge.reverse("zhong1"))
# [{'text': '中', 'pinyin': 'zhōng', 'scope': 'char'}]

公共 API

当前公开接口只有以下函数:

  • analyze(text, *, segment=True, heteronym=False)
  • pinyin(text, *, style="plain", unit="char", segment=True, heteronym=False)
  • classify(text, *, kind="all", segment=True)
  • segment(text)
  • reverse(query, *, tone="auto", scope="all")
  • add_word(text, pinyin, source="user")
  • add_words(words, source="user")
  • delete_word(text, source="user")
  • list_words(source="user")
  • clear_words()

pinyin()

pinyin_bridge.pinyin("中国")
# ['zhong', 'guo']

pinyin_bridge.pinyin("中国", style="mark")
# ['zhōng', 'guó']

pinyin_bridge.pinyin("北京欢迎你", style="mark", unit="word")
# ['běi jīng', 'huān yíng', 'nǐ']

pinyin_bridge.pinyin("重", heteronym=True)
# [['zhong', 'chong', 'tong']]

参数:

  • style: plain / num / mark
  • unit: char / word
  • segment: 是否先分词后分析
  • heteronym: 是否输出多音字候选;开启后,多音字位置返回候选列表,确定读音仍返回字符串。对多字上下文,分词结果和 pypinyin 的上下文消歧会先收敛候选,不会穷举整句所有组合。

analyze()

analysis = pinyin_bridge.analyze("中国")
print([char.chosen.final for token in analysis.tokens for char in token.chars if char.chosen])
# ['ong', 'uo']

返回 TextAnalysis,适合需要访问 token、候选读音和细粒度分析结果的场景。

classify()

pinyin_bridge.classify("中国")
# {
#   '声母': [...],
#   '韵母': [...],
#   '声调': [...],
# }

pinyin_bridge.classify("中国", kind="final")
# [{'name': '后鼻音韵母', 'detail': [['中', 'zhōng']]}, ...]

kind 支持 allinitialfinaltone

segment()

pinyin_bridge.segment("北京欢迎你")
# ['北京', '欢迎', '你']

reverse()

pinyin_bridge.reverse("zhong")
pinyin_bridge.reverse("zhong1")

用于按拼音反查文本,返回包含 textpinyinscope 的列表。

参数:

  • tone: auto / strict / ignore
  • scope: all / char / word

用户词典

pinyin_bridge.add_word("乔布斯", "qiáo bù sī")
pinyin_bridge.add_words({"苹果": "píng guǒ"})
print(pinyin_bridge.list_words())
pinyin_bridge.delete_word("乔布斯")
pinyin_bridge.clear_words()

默认情况下,用户词典会写到包内的 pinyin_bridge/data/dict.db。如果这个路径不可写,可以通过环境变量指定位置:

export PINYIN_BRIDGE_DB_PATH=/path/to/dict.db

当默认路径不可写且没有显式指定 PINYIN_BRIDGE_DB_PATH 时,库会自动回退到系统临时目录中的可写数据库。

多进程共享

使用 Redis 实现多进程间的用户词典同步。设置环境变量启用:

export PINYIN_BRIDGE_REDIS_URL=redis://localhost:6379

或者在代码中指定:

from pinyin_bridge.data.shared_store import SharedDict

# 强制 Redis
shared = SharedDict("pinyin_bridge:user", mode="redis")

# 强制内存
shared = SharedDict("pinyin_bridge:user", mode="memory")

# 自动检测(默认)
shared = SharedDict("pinyin_bridge:user")

示例操作:

shared.set("中国", "zhong guo")
shared.get("中国")  # 'zhong guo'
shared.bulk_set([("你好", "ni hao"), ("世界", "shi jie")])
shared.get_all()  # {'你好': 'ni hao', '世界': 'shi jie'}
shared.delete("中国")
shared.clear()
shared.is_using_redis  # True/False

所有用户词典操作(add_worddelete_wordclear_words)会自动同步到 Redis,多个 Worker 进程可实时共享。

CLI 用法

pinyin-bridge "中国"
pinyin-bridge "中国" --style num
pinyin-bridge "北京欢迎你" --unit word
pinyin-bridge "中国" --classify final
pinyin-bridge "zhong1" --reverse --json
pinyin-bridge --add "乔布斯:qiáo bù sī"
pinyin-bridge -f input.txt -o output.txt

支持选项:

  • --style {plain,num,mark}
  • --unit {char,word}
  • --classify [initial|final|tone|all]
  • --reverse
  • --json
  • --add TEXT:PINYIN
  • -f, --file
  • -o, --output

说明:

  • 普通拼音输出默认为空格分隔文本。
  • --classify--reverse 输出结构化 JSON。
  • --json 会输出带 text / result 的数组载荷。

版本

当前包版本:0.2.0

更新日志

0.2.0

  • 字典数据嵌入包内 (pinyin_bridge/data/dict.db)
  • 新增 Redis 共享存储支持多进程同步
  • seed 文件移至 part1_characters_new.tsvpart2_phrases_new.tsv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinyin_bridge-0.4.0.tar.gz (535.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pinyin_bridge-0.4.0-py3-none-any.whl (534.5 kB view details)

Uploaded Python 3

File details

Details for the file pinyin_bridge-0.4.0.tar.gz.

File metadata

  • Download URL: pinyin_bridge-0.4.0.tar.gz
  • Upload date:
  • Size: 535.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pinyin_bridge-0.4.0.tar.gz
Algorithm Hash digest
SHA256 966414fe83e6aaaec86ab5ff9b1716f10deb9afd48e0868568f86bad417c7bf3
MD5 e7d393c447cb26777b9f673a3f2be78d
BLAKE2b-256 ebbd24a0e85a087f7333bfd89e83b72aeabfa8b85c3af941b1ebb9acc0f7ad87

See more details on using hashes here.

File details

Details for the file pinyin_bridge-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: pinyin_bridge-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 534.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pinyin_bridge-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a7aa9c5f005a53d9c78cb45138746bcf44cfad39974fba3f38ea42b3b41a6ae9
MD5 df7a188a27361dc262b31299c684bfa2
BLAKE2b-256 fed87ce85f60e45d5ceaacb29e8edd9666887bae792100d2c4558fd40d49f7cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page