Skip to main content

High-performance BM25 Chinese text search with jieba-rs tokenizer (Rust implementation)

Project description

BM25-Jieba 中文文本搜索

PyPI version Python License Build Downloads

基于 Rust + PyO3 的高性能 BM25 中文文本搜索库,使用 jieba-rs 进行中文分词。

特性

  • 🚀 高性能: Rust 实现,比纯 Python 快数倍
  • 🔤 中文分词: 内置 jieba-rs 分词器
  • 🎯 精确搜索: 经典 BM25 算法
  • 🔠 大小写混合: 支持大小写不敏感搜索
  • 🐍 Python 3.11 ~ 3.14: 支持最新 Python 版本

安装

# 开发模式安装
uv run maturin develop

# 或构建 wheel
maturin build --release
pip install target/wheels/*.whl

快速开始

from bm25_jieba import BM25

# 准备文档
documents = [
    "Python是一种广泛使用的高级编程语言",
    "机器学习是人工智能的一个分支",
    "深度学习是机器学习的子领域",
]

# 创建并训练模型
bm25 = BM25(k1=1.5, b=0.75)
bm25.fit(documents)

# 搜索
results = bm25.search("机器学习", top_k=3)
for doc_idx, score in results:
    print(f"[{score:.4f}] {documents[doc_idx]}")

API 参考

BM25(k1=1.5, b=0.75, lowercase=False)

创建 BM25 实例。

参数 类型 默认值 说明
k1 float 1.5 词频饱和参数
b float 0.75 文档长度归一化参数
lowercase bool False 是否将文本转换为小写(大小写不敏感)

fit(documents: list[str])

使用文档语料库训练模型。

search(query: str, top_k: int = None) -> list[tuple[int, float]]

搜索最相关的文档,返回 (文档索引, 分数) 列表。

get_scores(query: str) -> list[float]

获取所有文档的 BM25 分数。

开发

# 安装依赖
uv sync

# 编译并安装
uv run maturin develop

# 运行测试
uv run pytest

# 运行示例
uv run python examples/demo.py

技术栈

组件 版本 用途
PyO3 0.27.2 Rust-Python 绑定
maturin 1.11.5 构建工具
jieba-rs 0.8.1 中文分词

性能测试

在 Apple M1 上测试 (10,000 文档,每文档约 50 字):

测试项 结果
索引速度 ~38,000 docs/s
搜索 QPS ~2,000 QPS
搜索延迟 0.5-1.2ms
# 运行性能测试
uv run python tests/benchmark.py

算法验证

测试语料库:19 个文档,6 个查询

验证项 结果 说明
公式正确性 手动计算与实现一致
排序一致性 6/6 查询与 rank-bm25 排序完全一致
绝对分数 ⚠️ 因 IDF +1 修正略有差异(符合预期)
# 运行验证
uv sync --group validation
uv run python tests/validate.py

算法实现说明

本实现采用带 +1 修正的 IDF 公式:

IDF(t) = ln((N - df + 0.5) / (df + 0.5) + 1)

与标准 BM25Okapi 的区别:

公式 特点
标准: ln((N-df+0.5)/(df+0.5)) 可能产生负 IDF
本实现: ln(...+1) 保证 IDF ≥ 0

影响

  • 排序一致 - 与 rank-bm25 等标准实现排序结果相同
  • ⚠️ 绝对分数不同 - 因 +1 修正,分数值略有差异
  • 数值稳定 - 无负值,无需额外处理

这种变体在只关心相对排序(而非绝对分数)的场景下完全适用。

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm25_jieba-0.2.0.tar.gz (30.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bm25_jieba-0.2.0-cp314-cp314-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.14Windows x86-64

bm25_jieba-0.2.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

bm25_jieba-0.2.0-cp314-cp314-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

bm25_jieba-0.2.0-cp313-cp313-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.13Windows x86-64

bm25_jieba-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

bm25_jieba-0.2.0-cp313-cp313-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

bm25_jieba-0.2.0-cp312-cp312-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.12Windows x86-64

bm25_jieba-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

bm25_jieba-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

bm25_jieba-0.2.0-cp311-cp311-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.11Windows x86-64

bm25_jieba-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

bm25_jieba-0.2.0-cp311-cp311-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file bm25_jieba-0.2.0.tar.gz.

File metadata

  • Download URL: bm25_jieba-0.2.0.tar.gz
  • Upload date:
  • Size: 30.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8f6b3d2cc4442981fbe088a6cf4e53af15871026161ba9830f789a4347606b42
MD5 7cf0d41e7be9025e18413ae487a3e031
BLAKE2b-256 8bdd2948963a8eee7751a5318221523a2175b600ed2c5d0e346c928a7b5fe2b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0.tar.gz:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.2.0-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 5d9cb16821d5a8be673448b6af42cfb282af1fc26ff7dc5c07f178e17aebac66
MD5 a5ddbc3426bcefdf6fde5d05d63d6f4e
BLAKE2b-256 e120b67b52fe0e071afeab52da89b6643694bb3e1ac3d8c0b7c5ca1335165834

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp314-cp314-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 63a81278bd6f0f5215408e8ddff46e5874101f5b2ece5b1102ffe1020433f701
MD5 4dfdd7ff736ad528a09ca3d3265e85eb
BLAKE2b-256 b7158809b3de89cb15e122dc5bc3943aba3c007844f6a609078f99b3975243a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.0-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0a2980eb8fe038fcdb3a893a74b3ff7fa429c94f28b89ff354f507731eba737b
MD5 ac27e5f06a1cc11a3a316a7d0cb889fa
BLAKE2b-256 85ebf873b6a05bbb18690841b044e1b98a9698efe2289952ebdf031d4b55d4c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.2.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d4c65a21eb1f9ff22f67588eef0a7f392b4d2277828b0d99835cdf3c69937196
MD5 8c392edbaa9a2d1c780cd2e4e308d733
BLAKE2b-256 c84895c97fee3091762f81832e8e714427091f893099d1eeb5fb71e17b17c75e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp313-cp313-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0d80f8631464337c2cd069452bfb89ee71e7ad6c826777338ef7bcf1720c56ab
MD5 e50749db15594bf4bb0e5e6926117efe
BLAKE2b-256 3ae7b8668a24fc3de147d644ae6835e450150bf22878bacaa39fa7968967bd41

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ed5d14443a98440b9da1b06322a1069356a5444338ed69043fdf48559a9b4712
MD5 6d65ddb16ad0b8f35c06dc1cf758a8c2
BLAKE2b-256 40eff45976ae255445aba41f07acde35c374ff5263b512af30fe32fcf8b26cbf

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.2.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 34674d2fe60d8c5c318733238ab7759e6a41a351b4bedd18233f4fd79964f980
MD5 fbbae2406e22c2e6be65a0028b699904
BLAKE2b-256 92d87b34cdb3058998fb4252dfc29066a68cd58cc09b145d010d96f5a32ce22a

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp312-cp312-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 636297d4751f1c527f91c4e4226a7540762c20cf52b8d53e0d8707af02a5143b
MD5 46fefa7a3f94f88fe08593a31f53f518
BLAKE2b-256 9ef8ad69331de3391b3d04704ac59a523bca8b9ccd139bf46ff99c321d0bcdc4

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 953be92c60ed803c5116c3b87c27e4dd2bbe2537219a8409cdfe5a048d11bc08
MD5 75867daabc82779df7cd96824c2fc553
BLAKE2b-256 1c2ea8b9e50a10d04f926b168ce6fc7e89af82bddb8e5c9d0197a6f5a2355444

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bm25_jieba-0.2.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bm25_jieba-0.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e0567e96342988cf7478f7eabe0c7735ac99b526e375e74868faa9dda5a22f2f
MD5 2183c934acffd8d48427abf1d0897876
BLAKE2b-256 244982d8be025314d0b49dd6bb87fc37192c90c4aad04a99222979ac999526e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp311-cp311-win_amd64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b00585ea111ebc531ed253ceb04035c3235b6a77490ce54edf3d8b1255e5b206
MD5 c1d274bbb06f50cd33e8baaefba8dde5
BLAKE2b-256 053352c576fa2d1477d6a15e138628d9e0e4ec2c30624b6adab6f59fd75f005c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_jieba-0.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bm25_jieba-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4f3156774376ddcac44374905c80b3ae7ec0df386bb4ef7f8e9c396ce0a19b6f
MD5 1852920e7fd3e829a90707c1bae11363
BLAKE2b-256 a2dd7f89cae486a78f33e0f2abdd2312e2755d6c978f5ac3d4e6e3411dd930e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_jieba-0.2.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on twn39/bm25-jieba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page