Skip to main content

汉字拼音转换工具,支持多进程共享

Project description

Pinyin-Bridge 使用文档

安装

pip install pinyin-bridge

开发模式安装:

git clone <repository>
cd pinyin-bridge
pip install -e .

使用 uv 的本地开发环境:

uv venv .venv
uv pip install -e ".[dev]"
./.venv/bin/python -B -m pytest -q

词级数据使用仓库内的 seed 文件维护,位于 pinyin_bridge/data/seed/*.tsv。修改后可用下面的脚本做基础校验:

./.venv/bin/python scripts/validate_seed_data.py

为保证数据正确性,seed 文件要求每条记录都带上 review_statussource_ref。当前运行时只加载 approved 数据,且任何与 pypinyin 默认词典不一致的条目都必须在 notes 中写明原因。

对于需要人工纠偏的词条,统一放在 pinyin_bridge/data/seed/manual_reviewed.tsv。这类数据应尽量少,只收高置信度、可说明来源的 override,并建议同步补到黄金样例测试中。

待审核候选词单独维护在 pinyin_bridge/data/seed/review_candidates.tsv,默认不会进入运行时。可以用下面的脚本导出审计结果:

./.venv/bin/python scripts/audit_review_candidates.py

这个报告会列出每个候选词当前是 missingcovered 还是 override,便于你按批次做人工确认。

快速开始

import pinyin_bridge

print(pinyin_bridge.pinyin("中国"))
# ['zhong', 'guo']

print(pinyin_bridge.pinyin("中国", style="num"))
# ['zhong1', 'guo2']

print(pinyin_bridge.pinyin("重", heteronym=True))
# [['zhong', 'chong', 'tong']]

print(pinyin_bridge.classify("中国", kind="final"))
# [{'name': '后鼻音韵母', 'detail': [['中', 'zhōng']]}, ...]

print(pinyin_bridge.segment("北京欢迎你"))
# ['北京', '欢迎', '你']

print(pinyin_bridge.reverse("zhong1"))
# [{'text': '中', 'pinyin': 'zhōng', 'scope': 'char'}]

公共 API

当前公开接口只有以下函数:

  • analyze(text, *, segment=True, heteronym=False)
  • pinyin(text, *, style="plain", unit="char", segment=True, heteronym=False)
  • classify(text, *, kind="all", segment=True)
  • segment(text)
  • reverse(query, *, tone="auto", scope="all")
  • add_word(text, pinyin, source="user")
  • add_words(words, source="user")
  • delete_word(text, source="user")
  • list_words(source="user")
  • clear_words()

pinyin()

pinyin_bridge.pinyin("中国")
# ['zhong', 'guo']

pinyin_bridge.pinyin("中国", style="mark")
# ['zhōng', 'guó']

pinyin_bridge.pinyin("北京欢迎你", style="mark", unit="word")
# ['běi jīng', 'huān yíng', 'nǐ']

pinyin_bridge.pinyin("重", heteronym=True)
# [['zhong', 'chong', 'tong']]

参数:

  • style: plain / num / mark
  • unit: char / word
  • segment: 是否先分词后分析
  • heteronym: 是否输出多音字候选;开启后,多音字位置返回候选列表,确定读音仍返回字符串。对多字上下文,分词结果和 pypinyin 的上下文消歧会先收敛候选,不会穷举整句所有组合。

analyze()

analysis = pinyin_bridge.analyze("中国")
print([char.chosen.final for token in analysis.tokens for char in token.chars if char.chosen])
# ['ong', 'uo']

返回 TextAnalysis,适合需要访问 token、候选读音和细粒度分析结果的场景。

classify()

pinyin_bridge.classify("中国")
# {
#   '声母': [...],
#   '韵母': [...],
#   '声调': [...],
# }

pinyin_bridge.classify("中国", kind="final")
# [{'name': '后鼻音韵母', 'detail': [['中', 'zhōng']]}, ...]

kind 支持 allinitialfinaltone

segment()

pinyin_bridge.segment("北京欢迎你")
# ['北京', '欢迎', '你']

reverse()

pinyin_bridge.reverse("zhong")
pinyin_bridge.reverse("zhong1")

用于按拼音反查文本,返回包含 textpinyinscope 的列表。

参数:

  • tone: auto / strict / ignore
  • scope: all / char / word

用户词典

pinyin_bridge.add_word("乔布斯", "qiáo bù sī")
pinyin_bridge.add_words({"苹果": "píng guǒ"})
print(pinyin_bridge.list_words())
pinyin_bridge.delete_word("乔布斯")
pinyin_bridge.clear_words()

默认情况下,用户词典会写到包内的 pinyin_bridge/data/dict.db。如果这个路径不可写,可以通过环境变量指定位置:

export PINYIN_BRIDGE_DB_PATH=/path/to/dict.db

当默认路径不可写且没有显式指定 PINYIN_BRIDGE_DB_PATH 时,库会自动回退到系统临时目录中的可写数据库。

多进程共享

使用 Redis 实现多进程间的用户词典同步。设置环境变量启用:

export PINYIN_BRIDGE_REDIS_URL=redis://localhost:6379

或者在代码中指定:

from pinyin_bridge.data.shared_store import SharedDict

# 强制 Redis
shared = SharedDict("pinyin_bridge:user", mode="redis")

# 强制内存
shared = SharedDict("pinyin_bridge:user", mode="memory")

# 自动检测(默认)
shared = SharedDict("pinyin_bridge:user")

示例操作:

shared.set("中国", "zhong guo")
shared.get("中国")  # 'zhong guo'
shared.bulk_set([("你好", "ni hao"), ("世界", "shi jie")])
shared.get_all()  # {'你好': 'ni hao', '世界': 'shi jie'}
shared.delete("中国")
shared.clear()
shared.is_using_redis  # True/False

所有用户词典操作(add_worddelete_wordclear_words)会自动同步到 Redis,多个 Worker 进程可实时共享。

CLI 用法

pinyin-bridge "中国"
pinyin-bridge "中国" --style num
pinyin-bridge "北京欢迎你" --unit word
pinyin-bridge "中国" --classify final
pinyin-bridge "zhong1" --reverse --json
pinyin-bridge --add "乔布斯:qiáo bù sī"
pinyin-bridge -f input.txt -o output.txt

支持选项:

  • --style {plain,num,mark}
  • --unit {char,word}
  • --classify [initial|final|tone|all]
  • --reverse
  • --json
  • --add TEXT:PINYIN
  • -f, --file
  • -o, --output

说明:

  • 普通拼音输出默认为空格分隔文本。
  • --classify--reverse 输出结构化 JSON。
  • --json 会输出带 text / result 的数组载荷。

版本

当前包版本:0.2.0

更新日志

0.2.0

  • 字典数据嵌入包内 (pinyin_bridge/data/dict.db)
  • 新增 Redis 共享存储支持多进程同步
  • seed 文件移至 part1_characters_new.tsvpart2_phrases_new.tsv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinyin_bridge-0.2.1.tar.gz (106.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pinyin_bridge-0.2.1-py3-none-any.whl (101.4 kB view details)

Uploaded Python 3

File details

Details for the file pinyin_bridge-0.2.1.tar.gz.

File metadata

  • Download URL: pinyin_bridge-0.2.1.tar.gz
  • Upload date:
  • Size: 106.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pinyin_bridge-0.2.1.tar.gz
Algorithm Hash digest
SHA256 5256f44441a0ecd7c694d1d95bc95f5c486ee6ea433abe5e235f490b6a273367
MD5 8277ffa85c183516e98ed760ff900b28
BLAKE2b-256 774dff706c40b65480d83cee1a69d6f446e5278a6153c216c266ab7f158cd1e5

See more details on using hashes here.

File details

Details for the file pinyin_bridge-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: pinyin_bridge-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 101.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pinyin_bridge-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 942c30dc530294b31403454d639eb8636ca526fb68404efdf56175216cf86bf6
MD5 a226a5c5ac00ed0dacccc9dd52aee741
BLAKE2b-256 f993a202bd7044656339037b7b11bfc49ddc6bde2d184d1bc6d8ad57cdc8940a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page