Skip to main content

Fast fuzzy text search

Project description

Narrow Down - Efficient near-duplicate search

PyPI - Version PyPI - Python Version Tests Codecov License

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Black pre-commit Contributor Covenant

Narrow Down offers a flexible but easy-to-use Python API to finding duplicates or similar documents also in very large datasets. It reduces the O(n²) problem of comparing all strings with each other to linear scale by using approximation algorithms like Locality Sensitive Hashing.

Features

  • Document indexing and search based on the Minhash LSH algorithm
  • High performance thanks to a native extension module in Rust
  • Easy-to-use API with automated parameter tuning
  • Works with exchangeable storage backends. Currently implemented:
    • In-Memory
    • Cassandra / ScyllaDB
    • SQLite
    • User defined backends (by implementing a small interface)
  • Native asyncio interface

Installation

The Python package can be installed with pip:

pip install narrow-down

Extras

Some of the heavier functionality is available as extra:

pip install narrow-down[scylladb]   # Cassandra / ScyllaDB storage backend

Similar projects

  • pylsh offers a good implementation of the classic Minhash LSH scheme in Python and Cython. If you only need this and you don't need a database backend it can be a good choice.
  • Datasketch implements an interesting collection of different data sketching algorithms for similarity matching, cardinality estimation and k-nearest-neighbour search. The implementation is not highly optimized but very well usable, the documentation rich and multiple database backends can be used for some of the sketches
  • Milvus offers a full database stack for vector search, a different approach for fast searching. It can also be applied to text search when an embedding like Word2Vec or Bert is used to vectorize the text.

Credits

This package was created with Cookiecutter and the fedejaure/cookiecutter-modern-pypackage project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

narrow_down-1.0.1-cp37-abi3-win_amd64.whl (263.2 kB view details)

Uploaded CPython 3.7+Windows x86-64

narrow_down-1.0.1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

narrow_down-1.0.1-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (749.6 kB view details)

Uploaded CPython 3.7+macOS 10.9+ universal2 (ARM64, x86-64)macOS 10.9+ x86-64macOS 11.0+ ARM64

File details

Details for the file narrow_down-1.0.1-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for narrow_down-1.0.1-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d87de0c112c706cc9f971a9c367010778ef47b51f115befaa00781128e4bc927
MD5 c10a5c960f02c0a0be180097af54ae1c
BLAKE2b-256 ec4c6e2e944e0126dcae7a7b3bd290bbac0d32b66782cabdf0795a244e846f5b

See more details on using hashes here.

File details

Details for the file narrow_down-1.0.1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for narrow_down-1.0.1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 951a88905c73cfb27eb31c7d2ef1fc40dd4e0c8ca3c4ebd485da46e6c244056d
MD5 e76cfedacd4bf73f3fc156e1f7add200
BLAKE2b-256 f30e21568cba6cbb1571797e136d899d37c1c19528aba255547b56195513fa1e

See more details on using hashes here.

File details

Details for the file narrow_down-1.0.1-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for narrow_down-1.0.1-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 6536cdad1778dd60a362036adcbb4d5d1a836ba369bf4dd6e4adedb83123406f
MD5 78b47bf689d404569a07e757542eeb64
BLAKE2b-256 db694dab34e42ecf3715671e2346252713bdc1dfc45d1a92b59147a34f6ada96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page