Skip to main content

High-throughput MinHash + LSH toolkit for large-scale text corpus deduplication and dense near-duplicate mining.

Project description

lshcurator

lshcurator 是一款面向大规模语料的近重复(near-duplicate)挖掘与去重工具,本质上依旧是基于 MinHash + LSH 的实现。 主要改善了传统 LSH 去重在处理大规模语料上常见的性能瓶颈和硬件利用效率的问题。

背景

传统 LSH 去重的工程的步骤依次为:

  1. 对每个文本片段计算 MinHash 签名;
  2. 按 band/row 切分;
  3. 将 band key 映射到桶;
  4. 桶内维护所有签名或索引;
  5. 查询候选并验证相似度。

会遇到的典型问题是:

  • 内存瓶颈:在大规模语料上维护大量桶占用大量内存,尤其是在大部分样本唯一的情况下;
  • 计算瓶颈:模板化的语料会制造大量热点桶,桶内候选急剧膨胀,导致效率下降;
  • 硬件利用率低:跨文件、多语料需要流式处理,难以利用多核 CPU 进行加速。

本项目的核心优化思路

1) 解偶 MinHash 计算和 LSH 桶维护的流程,专注高相似密集区

统计阶段只生成和收集 band key,不维护桶以及候选列表,去重阶段只对 hot keys 建桶和维护代表样本。

收益在于可充分利用多核 CPU 并行运算性能同步统计多份语料,同时桶结构规模从“全量 keys”降低到“热点 keys 子集”,避免了大规模唯一样本带来的性能瓶颈和内存压力。

2) 扁平化 band key

将传统 bytes key 压缩为无符号双精度整数(uint64)指纹,降低 key 对象和内容的维护开销。

指纹基于哈希算法不可避免存在极小的碰撞概率,本项目已做碰撞防误伤处理,如担心碰撞风险可通过提升 digest 位数(如 uint128)来进一步降低碰撞可能性,但会增加单 key 内存占用和计算开销,需根据实际需求权衡选择。

3) 有界代表元(Bounded Representatives)

通过限制每个桶的代表样本数量,避免了极热桶“无限增长”的问题。

4) 引入 numpy 进行高效的批量计算和数据处理

避免使用 Python 原生数据结构,节省海量 dict 和 list 的内存开销,同时提升性能和稳定性。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lshcurator-0.0.1.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lshcurator-0.0.1-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file lshcurator-0.0.1.tar.gz.

File metadata

  • Download URL: lshcurator-0.0.1.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for lshcurator-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0559c645be32ca9ddf4988df452bda4fee6cd252b1380cdef21ea295f6ec18f5
MD5 a08e0401e5e7eef927dc0a883bd2ea4f
BLAKE2b-256 1ad9f99dbc4a97ae1b0ca938d7db238af034d9271264aaa9a917253d638be839

See more details on using hashes here.

File details

Details for the file lshcurator-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: lshcurator-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for lshcurator-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7605538ff334669d3ad909370ea0f37ca2ab75525e51fec0c52af154dc290187
MD5 d8b63782d63d3c2f6115fe46cebbccdd
BLAKE2b-256 d78361a025be12ee284cc2a9fdea591e4e8263e17e575cd066347ca601fbccd7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page