Skip to main content

High-performance BM25 Chinese text search with jieba-rs tokenizer (Rust implementation)

Project description

BM25-Jieba 中文文本搜索

PyPI version Python License

基于 Rust + PyO3 的高性能 BM25 中文文本搜索库,使用 jieba-rs 进行中文分词。

特性

  • 🚀 高性能: Rust 实现,比纯 Python 快数倍
  • 🔤 中文分词: 内置 jieba-rs 分词器
  • 🎯 精确搜索: 经典 BM25 算法
  • 🐍 Python API: 简洁易用的接口

安装

# 开发模式安装
uv run maturin develop

# 或构建 wheel
maturin build --release
pip install target/wheels/*.whl

快速开始

from bm25_jieba import BM25

# 准备文档
documents = [
    "Python是一种广泛使用的高级编程语言",
    "机器学习是人工智能的一个分支",
    "深度学习是机器学习的子领域",
]

# 创建并训练模型
bm25 = BM25(k1=1.5, b=0.75)
bm25.fit(documents)

# 搜索
results = bm25.search("机器学习", top_k=3)
for doc_idx, score in results:
    print(f"[{score:.4f}] {documents[doc_idx]}")

API 参考

BM25(k1=1.5, b=0.75)

创建 BM25 实例。

参数 类型 默认值 说明
k1 float 1.5 词频饱和参数
b float 0.75 文档长度归一化参数

fit(documents: list[str])

使用文档语料库训练模型。

search(query: str, top_k: int = None) -> list[tuple[int, float]]

搜索最相关的文档,返回 (文档索引, 分数) 列表。

get_scores(query: str) -> list[float]

获取所有文档的 BM25 分数。

开发

# 安装依赖
uv sync

# 编译并安装
uv run maturin develop

# 运行测试
uv run pytest

# 运行示例
uv run python examples/demo.py

技术栈

组件 版本 用途
PyO3 0.27.2 Rust-Python 绑定
maturin 1.11.5 构建工具
jieba-rs 0.8.1 中文分词

性能测试

在 Apple M1 上测试 (10,000 文档,每文档约 50 字):

测试项 结果
索引速度 ~38,000 docs/s
搜索 QPS ~2,000 QPS
搜索延迟 0.5-1.2ms
# 运行性能测试
uv run python tests/benchmark.py

算法验证

测试语料库:19 个文档,6 个查询

验证项 结果 说明
公式正确性 手动计算与实现一致
排序一致性 6/6 查询与 rank-bm25 排序完全一致
绝对分数 ⚠️ 因 IDF +1 修正略有差异(符合预期)
# 运行验证
uv sync --group validation
uv run python tests/validate.py

算法实现说明

本实现采用带 +1 修正的 IDF 公式:

IDF(t) = ln((N - df + 0.5) / (df + 0.5) + 1)

与标准 BM25Okapi 的区别:

公式 特点
标准: ln((N-df+0.5)/(df+0.5)) 可能产生负 IDF
本实现: ln(...+1) 保证 IDF ≥ 0

影响

  • 排序一致 - 与 rank-bm25 等标准实现排序结果相同
  • ⚠️ 绝对分数不同 - 因 +1 修正,分数值略有差异
  • 数值稳定 - 无负值,无需额外处理

这种变体在只关心相对排序(而非绝对分数)的场景下完全适用。

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm25_jieba-0.1.2.tar.gz (27.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bm25_jieba-0.1.2-cp314-cp314-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.14Windows x86-64

bm25_jieba-0.1.2-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

bm25_jieba-0.1.2-cp314-cp314-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

bm25_jieba-0.1.2-cp313-cp313-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.13Windows x86-64

bm25_jieba-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

bm25_jieba-0.1.2-cp313-cp313-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

bm25_jieba-0.1.2-cp312-cp312-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.12Windows x86-64

bm25_jieba-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

bm25_jieba-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

bm25_jieba-0.1.2-cp311-cp311-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.11Windows x86-64

bm25_jieba-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

bm25_jieba-0.1.2-cp311-cp311-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file bm25_jieba-0.1.2.tar.gz.

File metadata

  • Download URL: bm25_jieba-0.1.2.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.1.2.tar.gz
Algorithm Hash digest
SHA256 27515dac404fb793450da33e5c306f30ecaf140f86b1fdac79cf2dde40c3a2d1
MD5 32d2b321b2ad5f63fd7cb42029407bd0
BLAKE2b-256 4d8209b7567ee1caa5328385b685aa3b3b832d7cb5ddb9d2dea118bb5cdccf57

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2.tar.gz:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.1.2-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.1.2-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 ae0ce8454f8112801d9079a685199fcf674122afb3558f6a3d849c42ee280a67
MD5 b4ad27e5d14047b8f3e67433b1e6f536
BLAKE2b-256 5113931e0f8f4e29e15573faa4c5120cd5fba46f506ba65f9e4e37c9d48bd816

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp314-cp314-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.2-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ca9f153644d89ed686c3b4ec47570aa6b25f582668195df92d48d05f069829c1
MD5 ee3134022308027187e2bf4b87a7b1cd
BLAKE2b-256 4fb0014814f5343681279919b28eea4a98dbc14e81f1462def1bc956060dffaf

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.2-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3e12db7e66e5889bb490b88e3f54e1f1c8a9524ddd6cd2dceb40357fd18708a0
MD5 6f91a49eec9f937b5c15d72245259ee7
BLAKE2b-256 8e458a00307bc9a9785783936eaa22977cd4028c6bdeda448254da14f7dcd62a

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.1.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 70b7ffddb89dbc98b5a616f8c0125d514d8f89c5d15cb31d8ab42451d280a907
MD5 97813316a0f68df431058c768f9ef5ff
BLAKE2b-256 6d6812da6610971605719134886a922d4dce1a4694b12fba86ed508477efd567

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp313-cp313-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7d7fb2937465094b250e7626bfd3e6a9a757dd352d2da92f8bc69d01092c5ee8
MD5 afcaa07898142a5ffbe8877320f530bd
BLAKE2b-256 35556f3c473536082f88d9a58c67e0e3b41b979aa7074d7212e484cf8f376185

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 57caa6d4c0c8527a996aa3fbaaaf5b6f286a3c270bd0daa562d063be621ce051
MD5 aade3e58f853ff8faa7f6548b7a39aab
BLAKE2b-256 b25edf7b4f5657fbfb03122dda22a9baf21cc4b05516b41e39ee2a3d2453aa5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.1.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 9b19a76d356637521cc52ca03bb9c284dcfd60cb545928b153d5903e9f60b552
MD5 10fb15bea2b68517b9b3afa38d6b0a52
BLAKE2b-256 9f37f2e1235dd119a9d08fe82a472b91236ca19e7652b77414fbdb1e7bad8d0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp312-cp312-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1ca7e9fcee92d2faa3bb99cbd653683f7f1ddab315a443cacc8e9b78a55d7470
MD5 2a0c9dca8c61fc27084b344119d64388
BLAKE2b-256 23de7c7ac01216f0b3941c5ae2afe6726df2eca7181c27bd1e928176a98fa957

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2b0a24770002b4ee63a9f823b4568b937ed2cd9bffa218a9cd546576e4db88f0
MD5 22858b32496921467cb84575608f7214
BLAKE2b-256 371edbd663b3d106081c32d048271a237c13466ed6044c007ac11bf356a677e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8099f4c342fc49c0a94b190719ee204d0670b7ed20cfe9b7cc41f39cf9fc24bb
MD5 042cc989ad766e11c83dcb30c29a3468
BLAKE2b-256 e87060e533d83d594c6449c41e265471282f70f3c968869f81aec3b596aa1713

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp311-cp311-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ca0a5251f7119b696b8d06a2bf9af4749704f2b6f43c2dcc72db2a4aa36ee862
MD5 a012e9263146ca81fd0d119db0f24533
BLAKE2b-256 19ad946d398663843fec9baa00e811ce118ba875ee7bdd10b76124db05b243bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.1.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.1.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f370d7bedabf7dffa7210a25dc760e9e140133d0e4b957b1a1f6f61f316aefae
MD5 cd0bc546846f956e6300f21bc694513d
BLAKE2b-256 aa6ec38a36a5b1f79cfb41551668039e67b498113da3099f9170f2239f952a0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.1.2-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page