Skip to main content

High-performance BM25 Chinese text search with jieba-rs tokenizer (Rust implementation)

Project description

BM25-Jieba 中文文本搜索

PyPI version Python License

基于 Rust + PyO3 的高性能 BM25 中文文本搜索库,使用 jieba-rs 进行中文分词。

特性

  • 🚀 高性能: Rust 实现,比纯 Python 快数倍
  • 🔤 中文分词: 内置 jieba-rs 分词器
  • 🎯 精确搜索: 经典 BM25 算法
  • 🐍 Python API: 简洁易用的接口

安装

# 开发模式安装
uv run maturin develop

# 或构建 wheel
maturin build --release
pip install target/wheels/*.whl

快速开始

from bm25_jieba import BM25

# 准备文档
documents = [
    "Python是一种广泛使用的高级编程语言",
    "机器学习是人工智能的一个分支",
    "深度学习是机器学习的子领域",
]

# 创建并训练模型
bm25 = BM25(k1=1.5, b=0.75)
bm25.fit(documents)

# 搜索
results = bm25.search("机器学习", top_k=3)
for doc_idx, score in results:
    print(f"[{score:.4f}] {documents[doc_idx]}")

API 参考

BM25(k1=1.5, b=0.75)

创建 BM25 实例。

参数 类型 默认值 说明
k1 float 1.5 词频饱和参数
b float 0.75 文档长度归一化参数

fit(documents: list[str])

使用文档语料库训练模型。

search(query: str, top_k: int = None) -> list[tuple[int, float]]

搜索最相关的文档,返回 (文档索引, 分数) 列表。

get_scores(query: str) -> list[float]

获取所有文档的 BM25 分数。

开发

# 安装依赖
uv sync

# 编译并安装
uv run maturin develop

# 运行测试
uv run pytest

# 运行示例
uv run python examples/demo.py

技术栈

组件 版本 用途
PyO3 0.27.2 Rust-Python 绑定
maturin 1.11.5 构建工具
jieba-rs 0.8.1 中文分词

性能测试

在 Apple M1 上测试 (10,000 文档,每文档约 50 字):

测试项 结果
索引速度 ~38,000 docs/s
搜索 QPS ~2,000 QPS
搜索延迟 0.5-1.2ms
# 运行性能测试
uv run python tests/benchmark.py

算法验证

测试语料库:19 个文档,6 个查询

验证项 结果 说明
公式正确性 手动计算与实现一致
排序一致性 6/6 查询与 rank-bm25 排序完全一致
绝对分数 ⚠️ 因 IDF +1 修正略有差异(符合预期)
# 运行验证
uv sync --group validation
uv run python tests/validate.py

算法实现说明

本实现采用带 +1 修正的 IDF 公式:

IDF(t) = ln((N - df + 0.5) / (df + 0.5) + 1)

与标准 BM25Okapi 的区别:

公式 特点
标准: ln((N-df+0.5)/(df+0.5)) 可能产生负 IDF
本实现: ln(...+1) 保证 IDF ≥ 0

影响

  • 排序一致 - 与 rank-bm25 等标准实现排序结果相同
  • ⚠️ 绝对分数不同 - 因 +1 修正,分数值略有差异
  • 数值稳定 - 无负值,无需额外处理

这种变体在只关心相对排序(而非绝对分数)的场景下完全适用。

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm25_jieba-0.1.1.tar.gz (27.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bm25_jieba-0.1.1-cp312-cp312-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.12Windows x86-64

bm25_jieba-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

bm25_jieba-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file bm25_jieba-0.1.1.tar.gz.

File metadata

  • Download URL: bm25_jieba-0.1.1.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.1.1.tar.gz
Algorithm Hash digest
SHA256 65cfc0fc2022e24d1193da784b14a9fef851380f6421d65a453eea1aa2b32fcb
MD5 afdedad238bb060d8f9703760627280e
BLAKE2b-256 3803e7489a8ea08e8d7fbc06242b9be0d79b057c626adafc3be02cc5debe91a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.1.tar.gz:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.1.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 20e479dd154f9d69aa140104ffa9e7f956ec26646d037ac8b73f93cd3d3a11fe
MD5 964e5f56fecd2d7947acbb8e131c44a8
BLAKE2b-256 ca0660cca6087539326c000146ad58a3063762954c89bb391838524ddf824c52

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.1-cp312-cp312-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 19a4ff3a205545eb2befb8c356c1da198f45309157dde5f9c59e6e467e1943dc
MD5 8e45a724789d4cd2355d62f55973944a
BLAKE2b-256 cbe516d7e0a2086c873c98b2295604c313030fc01067dc12278e7a443002977d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a772d58ae251f7f1630b6d7d41fab427433595f48c8898f2ebd7369038be8c01
MD5 5576cabdb5c194594cadcd55b207d6c1
BLAKE2b-256 7951b2353ea8aaac5d62842045d6c9901c3628f850567cf8b586064b401e8060

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page