Skip to main content

High-performance BM25 Chinese text search with jieba-rs tokenizer (Rust implementation)

Project description

BM25-Jieba 中文文本搜索

PyPI version Python License Build Downloads

基于 Rust + PyO3 的高性能 BM25 中文文本搜索库,使用 jieba-rs 进行中文分词。

特性

  • 🚀 高性能: Rust 实现,采用 倒排索引 + Block-Max WAND 算法加速,比纯 Python 快数倍
  • 💾 持久化: 支持存取索引到磁盘 (MessagePack 格式),无需重复训练
  • 🔤 中文分词: 内置 jieba-rs 分词器
  • 🎯 精确搜索: 经典 BM25 算法
  • 🆔 自定义 ID: 支持绑定外部文档 ID (如数据库主键 u64)
  • 🔠 大小写混合: 支持大小写不敏感搜索
  • 🐍 Python 3.11 ~ 3.14: 支持最新 Python 版本

安装

# 开发模式安装
uv run maturin develop

# 或构建 wheel
maturin build --release
pip install target/wheels/*.whl

快速开始

from bm25_jieba import BM25

# 准备文档
documents = [
    "Python是一种广泛使用的高级编程语言",
    "机器学习是人工智能的一个分支",
    "深度学习是机器学习的子领域",
]

# 创建并训练模型 (可选传入自定义 ID)
bm25 = BM25(k1=1.5, b=0.75)
bm25.fit(documents, ids=[101, 102, 103])

# 搜索 (返回自定义 ID 和分数)
results = bm25.search("机器学习", top_k=3)
for doc_id, score in results:
    print(f"ID: {doc_id}, Score: {score:.4f}")

# 保存模型 (无需重复训练)
bm25.save("bm25_model.bin")

# 加载模型
loaded_bm25 = BM25.load("bm25_model.bin")

API 参考

BM25(k1=1.5, b=0.75, lowercase=False)

创建 BM25 实例。

参数 类型 默认值 说明
k1 float 1.5 词频饱和参数
b float 0.75 文档长度归一化参数
lowercase bool False 是否将文本转换为小写(大小写不敏感)

fit(documents: list[str], ids: list[int] = None)

使用文档语料库训练模型。

  • ids: 可选,与 documents 长度一致的整数列表 (u64)。
  • 如果不提供 ids,默认使用 0..N 作为 ID。

search(query: str, top_k: int = None) -> list[tuple[int, float]]

搜索最相关的文档,返回 (文档 ID, 分数) 列表。

save(path: str)

保存当前索引和配置到文件 (MessagePack 格式)。

load(path: str) -> BM25

从文件加载 BM25 模型。

get_scores(query: str) -> list[float]

获取所有文档的 BM25 分数。

开发

# 安装依赖
uv sync

# 编译并安装
uv run maturin develop

# 运行测试
uv run pytest

# 运行示例
uv run python examples/demo.py

技术栈

组件 版本 用途
PyO3 0.27.2 Rust-Python 绑定
maturin 1.11.5 构建工具
jieba-rs 0.8.1 中文分词

性能测试

在 Apple M1 上测试 (10,000 文档,每文档约 100 字):

测试项 结果
索引速度 ~37,000 docs/s
搜索 QPS ~1,000,000 QPS
搜索延迟 ~0.001ms

注:得益于 Block-Max WAND 算法的剪枝优化,搜索性能有数量级提升。

# 运行性能测试
uv run python tests/benchmark.py

算法验证

测试语料库:19 个文档,6 个查询

验证项 结果 说明
公式正确性 手动计算与实现一致
排序一致性 与 rank-bm25 排序完全一致
绝对分数 ⚠️ 因 IDF +1 修正略有差异(符合预期)
# 运行验证
uv sync --group validation
uv run python tests/validate.py

算法实现说明

本实现采用带 +1 修正的 IDF 公式:

IDF(t) = ln((N - df + 0.5) / (df + 0.5) + 1)

与标准 BM25Okapi 的区别:

公式 特点
标准: ln((N-df+0.5)/(df+0.5)) 可能产生负 IDF
本实现: ln(...+1) 保证 IDF ≥ 0

影响

  • 排序一致 - 与 rank-bm25 等标准实现排序结果相同
  • ⚠️ 绝对分数不同 - 因 +1 修正,分数值略有差异
  • 数值稳定 - 无负值,无需额外处理

这种变体在只关心相对排序(而非绝对分数)的场景下完全适用。

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm25_jieba-0.2.2.tar.gz (33.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bm25_jieba-0.2.2-cp314-cp314-win_amd64.whl (3.4 MB view details)

Uploaded CPython 3.14Windows x86-64

bm25_jieba-0.2.2-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

bm25_jieba-0.2.2-cp314-cp314-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

bm25_jieba-0.2.2-cp313-cp313-win_amd64.whl (3.4 MB view details)

Uploaded CPython 3.13Windows x86-64

bm25_jieba-0.2.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

bm25_jieba-0.2.2-cp313-cp313-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

bm25_jieba-0.2.2-cp312-cp312-win_amd64.whl (3.4 MB view details)

Uploaded CPython 3.12Windows x86-64

bm25_jieba-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

bm25_jieba-0.2.2-cp312-cp312-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

bm25_jieba-0.2.2-cp311-cp311-win_amd64.whl (3.4 MB view details)

Uploaded CPython 3.11Windows x86-64

bm25_jieba-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

bm25_jieba-0.2.2-cp311-cp311-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file bm25_jieba-0.2.2.tar.gz.

File metadata

  • Download URL: bm25_jieba-0.2.2.tar.gz
  • Upload date:
  • Size: 33.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.2.tar.gz
Algorithm Hash digest
SHA256 68297428746c237ad125ec80206c32bdcbd0deaa3e182ebb15c6b8b54182039d
MD5 53ee2b8f08313b136220a1a555f49eac
BLAKE2b-256 2da1af6ad6728a023174f4a2676e0fb5e0fff9bc1daa47b370e2d5853fb41e49

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2.tar.gz:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.2.2-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.2-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 861fe8f416a6a9e5d01313f924d0706fd374eee49cad22c9924886c0d23b5d21
MD5 2ed3980d2816e3dfc7a714ed9622d1fc
BLAKE2b-256 dbe8e68ff9cf59fcd64536be5a72eaab18d1f4b8b592e7545674d266910589a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp314-cp314-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.2-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1b9f0a2c40a0eea7c9ba263541cffc83ffc4c078691593aebedfe365be3e3662
MD5 d45383f5605859a1a0d739e2417010fc
BLAKE2b-256 8d4c1334c67670edded8c6b20fd3b786effde0d87f2cd76371d52114388d3862

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.2-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dd3de4f99f4bc285d65cd80b960a1b2e1f70431c755538ce22b62a511576aeeb
MD5 5052b1922804124b9b810e1ed63fbc3e
BLAKE2b-256 34436299ab46edd1677308fc7d6d06d720b86451492ea584bf29d742a51d7e43

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.2.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 9e70be2deee723481c612797841e13f62475f7b648346c2373bad03d6fde1c4f
MD5 0edfc8981c184d23f4f60cfffe2fcf6e
BLAKE2b-256 e49f6da95aced7ce3c845584ccc1d1d45e00c2058cf886dc1fcb93e94e503847

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp313-cp313-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c09d2872934632382f47b9d4f3bac7496516c9530a9071d4e21fa69fff396048
MD5 e9f6b0a3e67f7d64c2deb5459f74af1c
BLAKE2b-256 7cfb048bb51e3c5dda7f7e0c6cb0ec1205167e788b69f2f3b819c8896ff84562

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f3b6b6dae30284e675edf571fe7f100fc8414226cc63356f4c613e95ca8b569a
MD5 b709b38f9f58caf05672d090c856ccc6
BLAKE2b-256 5638f55293b2e0ae9ef3a96811121fa542969a7ffa470e1626258773f0358841

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.2.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 064c823a23d5e02315bc770ca224e43078905265e47fa3b46d470ba5f35575ac
MD5 9fe2fc875dd756a6af403d641b5d47c9
BLAKE2b-256 e4d2ce3a4025085411c3b9209d996d390e1311d518b0529b31e8fd0109f0df1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp312-cp312-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d95f2da38087fc7f022c4a52c02184eacc9d7ae168e7640cb4bb56b3bbddca40
MD5 081a2404624d7ebded7744dce30f73c0
BLAKE2b-256 16520baa57caac93ff8a7bd241af3ccd30e66a333063f1d999fd21a1742b0ec3

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bc8bde8029a95428630f943757d6d3534174408a5f87cc9f51f79ce07c7559d8
MD5 513ad2bad8cecdad4f83cc7e20a218b9
BLAKE2b-256 3d8d5f4f4e770dc6b0a27c5b97683e7e79ab317ab89bab0629e5f0a948cb2569

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.2.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 283b9537e4d1dd4ce3e501113e41e44b6109773d11def21f8480bc303385f42a
MD5 1614dc7a4537de2fe3a2c228159b89e9
BLAKE2b-256 b43e9669f8efc574791d2c8e802ea5e03871946f7abf335b945149d11025cc98

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp311-cp311-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a1e5996965febd11c119aa285ea8d35d99b46f8495ed7a3b6da0f51b4684323a
MD5 77f0090f036a8f8ef0648569dae0701a
BLAKE2b-256 ea395e927ed1824481996dff4d6179d68c544a6c3836d9b2e5874763ab8f00f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 68b3f06f4f3424d3ec768314acee92297d6a675d962c85531a28b5fea904a973
MD5 98b32fe2983c3821849157c562fee689
BLAKE2b-256 dee025b4d8901da0304d1a02eb1756f7f19546ff02b03a08c63fa09810419c67

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.2-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page