High-throughput MinHash + LSH toolkit for large-scale text corpus deduplication and dense near-duplicate mining.

These details have not been verified by PyPI

Project links

Repository

Project description

lshcurator

lshcurator 是一款面向大规模语料的近重复（near-duplicate）挖掘与去重工具，本质上依旧是基于 MinHash + LSH 的实现。主要改善了传统 LSH 去重在处理大规模语料上常见的性能瓶颈和硬件利用效率的问题。

背景

传统 LSH 去重的工程的步骤依次为：

对每个文本片段计算 MinHash 签名；
按 band/row 切分；
将 band key 映射到桶；
桶内维护所有签名或索引；
查询候选并验证相似度。

会遇到的典型问题是：

内存瓶颈：在大规模语料上维护大量桶占用大量内存，尤其是在大部分样本唯一的情况下；
计算瓶颈：模板化的语料会制造大量热点桶，桶内候选急剧膨胀，导致效率下降；
硬件利用率低：跨文件、多语料需要流式处理，难以利用多核 CPU 进行加速。

本项目的核心优化思路

1) 解偶 MinHash 计算和 LSH 桶维护的流程，专注高相似密集区

统计阶段只生成和收集 band key，不维护桶以及候选列表，去重阶段只对 hot keys 建桶和维护代表样本。

收益在于可充分利用多核 CPU 并行运算性能同步统计多份语料，同时桶结构规模从“全量 keys”降低到“热点 keys 子集”，避免了大规模唯一样本带来的性能瓶颈和内存压力。

2) 扁平化 band key

将传统 bytes key 压缩为无符号双精度整数（uint64）指纹，降低 key 对象和内容的维护开销。

指纹基于哈希算法不可避免存在极小的碰撞概率，本项目已做碰撞防误伤处理，如担心碰撞风险可通过提升 digest 位数（如 uint128）来进一步降低碰撞可能性，但会增加单 key 内存占用和计算开销，需根据实际需求权衡选择。

3) 有界代表元（Bounded Representatives）

通过限制每个桶的代表样本数量，避免了极热桶“无限增长”的问题。

4) 引入 numpy 进行高效的批量计算和数据处理

避免使用 Python 原生数据结构，节省海量 dict 和 list 的内存开销，同时提升性能和稳定性。

使用方式

生成和收集 band key

单线程实现

from lshcurator import Bucket

# 初始化 Bucket 实例
bucket = Bucket(bands=16, rows_per_band=4, shingle_k=5, shingle_step=1, compute_mode='char')

# 添加文本样本
bucket.insert(text="This is a sample text for LSH de-duplication.")
# 批量添加文本样本
texts = ["Another sample text.", "Yet another example for testing."]
bucket.batch_insert(texts=texts)

# 导出大于或等于指定频次的 band keys
hot_keys: numpy.ndarray = bucket.extract_keys(min_hit_count=2)

多线程实现

from concurrent.futures import ProcessPoolExecutor, as_completed
from lshcurator import Bucket
from pathlib import Path
from typing import Literal


def compute_hot_keys_for_corpus_file(
    corpus_file_path: Path,
    shingle_k: int,
    shingle_step: int,
    bands: int,
    rows_per_band: int,
    compute_mode: Literal['char', 'byte'],
):
    bucket = Bucket(
        shingle_k=shingle_k,
        shingle_step=shingle_step,
        bands=bands,
        rows_per_band=rows_per_band,
        compute_mode=compute_mode
    )

    for text in iter_corpus(file_path=corpus_file_path):  # 迭代读取语料文件中的文本样本（函数实现略）
        bucket.insert(text)

    hot_keys: numpy.ndarray = bucket.extract_keys(min_hit_count=2)  # 只导出出现两次以及以上的 band key
    # 后续步骤略


# 主进程中使用 ProcessPoolExecutor 来并行处理多个语料文件
# 假设 corpus_files 是一个包含多个语料文件路径的列表，max_jobs 是并行处理的最大进程数
with ProcessPoolExecutor(max_workers=max_jobs) as executor:
    futures = [
        executor.submit(
            compute_hot_keys_for_corpus_file,
            corpus_file_path=file_path,
            shingle_k=5,
            shingle_step=1,
            bands=16,
            rows_per_band=8,
            compute_mode='char',
        ) for file_path in corpus_files
    ]
    for future in as_completed(futures):
        future.result()

License

本项目采用 Apache License 2.0 许可证，允许用户在遵守许可证条款的前提下自由使用、修改和分发代码。

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.2.3

Apr 5, 2026

0.2.2

Mar 22, 2026

0.2.1

Mar 19, 2026

0.2.0

Mar 17, 2026

This version

0.1.0

Mar 9, 2026

0.0.1

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lshcurator-0.1.0.tar.gz (22.4 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lshcurator-0.1.0-py3-none-any.whl (23.0 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file lshcurator-0.1.0.tar.gz.

File metadata

Download URL: lshcurator-0.1.0.tar.gz
Upload date: Mar 9, 2026
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for lshcurator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f13660aa50ebbd85548ca30246a876c48175545e7f8b7bba53f8eae79a871f87`
MD5	`2a7a97c25de80d687b8b69fe89a1fc47`
BLAKE2b-256	`284558867ae5783b8473e76acb5d7d54c23ad1a516153c6faa05990b79659858`

See more details on using hashes here.

File details

Details for the file lshcurator-0.1.0-py3-none-any.whl.

File metadata

Download URL: lshcurator-0.1.0-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 23.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for lshcurator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c89b4a0ba1dfe1b4f2c3a9ccc086d4179889016b015c9ecb12b7d48e353c1fc6`
MD5	`5dacf476e68cd5b1bacdc68df8ba0442`
BLAKE2b-256	`30b7041c280a47759163ee2846cd14a243e34eb037e3748836233f922a7d7543`

See more details on using hashes here.

lshcurator 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

lshcurator

背景

本项目的核心优化思路

1) 解偶 MinHash 计算和 LSH 桶维护的流程，专注高相似密集区

2) 扁平化 band key

3) 有界代表元（Bounded Representatives）

4) 引入 numpy 进行高效的批量计算和数据处理

使用方式

生成和收集 band key

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes