Skip to main content

String algorithms

Project description

PyPI version Downloads Test Build and upload codecov DOI

pydivsufsort: bindings to libdivsufsort

pydivsufsort prebuilds libdivsufsort as a shared library and includes it in a Python package with bindings. Wheels are built for Linux, macOS and Windows (32 and 64 bits) using cibuildwheel and GitHub Actions. Basically, you should be able to install it with pip install pydivsufsort on any system and it should work out of the box. If it doesn't, please create an issue.

Features:

  • bindings to divsufsort that return numpy arrays
  • handle string, bytes and almost any integer data type (e.g. int64) and not only char
  • algorithms work even for non char inputs
  • additional string algorithms coded in Cython

Installation

On Linux, macOS and Windows:

python -m pip install pydivsufsort

We provide precompiled wheels for common systems using cibuildwheel, and a source distribution for Unix systems. Manual compilation on Windows might require some tweaking, please create an issue.

Features

All methods support string, bytes and numpy array inputs, including datatypes greater than uint8_t (e.g. uint64_t). Below are the signatures of all methods exposed by pydivsufsort. To import a method, just do from pydivsufsort import method_name. All methods are documented in the docstrings. You can display them with help(method_name).

A nicer interface to reuse computations lazily is provided in WonderString but currently undocumented. Please create an issue if you are interested.

Methods exposed from libdivsufsort

  • divsufsort(string): suffix array
  • bw_transform(string): Burrows-Wheeler transform
  • inverse_bw_transform(idx, string): inverse Burrows-Wheeler transform
  • sa_search(string, suffix_array, pattern): search for a pattern in a suffix array

Additional string algorithms

  • kasai(string, suffix_array=None): LCP array computation (lazily computes the suffix array if not provided)
  • lcp_segtree(string, suffix_array=None, lcp=None): build a segment tree for LCP queries (lazily computes the suffix array and LCP array if not provided)
  • lcp_query(segtree, queries): query a segment tree for LCP queries. Queries are pairs of indices.
  • levenshtein(string1, string2): Levenshtein distance
  • most_frequent_substrings(lcp, length, limit=0, minimum_count=1): most frequent substrings. See the docstring for details.
  • common_substrings(string1, string2, limit=25): common substrings between two strings.
  • min_rotation(string): minimum rotation of a string

Example usage

from pydivsufsort import divsufsort, kasai

string_inp = "banana$"
string_suffix_array = divsufsort(string_inp)
string_lcp_array = kasai(string_inp, string_suffix_array)
print(string_suffix_array, string_lcp_array)
# [6 5 3 1 0 4 2] [0 1 3 0 0 2 0]

# You can also convert the string input to integers first

import numpy as np

int_inp = np.unique(np.array(list(string_inp)), return_inverse=True)[1]
int_suffix_array = divsufsort(int_inp)
int_lcp_array = kasai(int_inp, int_suffix_array)
print(int_suffix_array, int_lcp_array)
# [6 5 3 1 0 4 2] [0 1 3 0 0 2 0]

Development

You can install locally with

pip install -e .

A useful command to iterate quickly when changing Cython code is

python setup.py build_ext --inplace && pytest -s

Profiling

Profiling can be activated with the environment variable PROFILE:

PROFILE=1 python setup.py build_ext --inplace && pytest -s

Here is an example with line_profiler (requires pip install "line_profiler<4"):

import line_profiler
from pydivsufsort import common_substrings
from pydivsufsort.stringalg import (
    _common_substrings,
    repeated_substrings,
)

s1 = "banana" * 10000
s2 = "ananas" * 10000

func = common_substrings
profile = line_profiler.LineProfiler(func)
profile.add_function(_common_substrings)
profile.add_function(repeated_substrings)
profile.runcall(func, s1, s2, limit=15)
profile.print_stats()

Testing

pytest

Technical details (for performance tweaks)

libdivsufsort is compiled in both 32 and 64 bits, as the 32 bits version is faster. pydivsufsort automatically chooses to use the 32 bits version when possible (aka when the input size is less than 2**31-1).

For best performance, use contiguous arrays. If you have a sliced array, pydivsufsort converts it automatically with numpy.ascontiguousarray.

The precompiled libraries use OpenMP. You can disable it by setting the env variable OMP_NUM_THREADS=1, and it will yield the same performance as the version compiled without OpenMP

The original libdivsufsort only supports char as the base type. pydivsufsort can handle arrays of any integer type (even signed), by encoding each element as multiple chars, which makes the computation slower. If your values use an integer type that is bigger than required, but they span over a small contiguous range, pydivsufsort will automatically change their type (see #6).

Acknowledgements

Citing

If you have used this software in a scientific publication, please cite it using the following BibLaTeX code:

@software{pydivsufsort,
  author       = {Louis Abraham},
  title        = {pydivsufsort},
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.7932458},
  url          = {https://github.com/louisabraham/pydivsufsort}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydivsufsort-0.0.15.tar.gz (302.2 kB view details)

Uploaded Source

Built Distributions

pydivsufsort-0.0.15-cp312-cp312-win_amd64.whl (264.9 kB view details)

Uploaded CPython 3.12 Windows x86-64

pydivsufsort-0.0.15-cp312-cp312-win32.whl (220.4 kB view details)

Uploaded CPython 3.12 Windows x86

pydivsufsort-0.0.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.15-cp312-cp312-macosx_11_0_arm64.whl (268.3 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

pydivsufsort-0.0.15-cp312-cp312-macosx_10_9_x86_64.whl (298.0 kB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

pydivsufsort-0.0.15-cp311-cp311-win_amd64.whl (267.7 kB view details)

Uploaded CPython 3.11 Windows x86-64

pydivsufsort-0.0.15-cp311-cp311-win32.whl (222.1 kB view details)

Uploaded CPython 3.11 Windows x86

pydivsufsort-0.0.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.15-cp311-cp311-macosx_11_0_arm64.whl (265.2 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

pydivsufsort-0.0.15-cp311-cp311-macosx_10_9_x86_64.whl (295.3 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

pydivsufsort-0.0.15-cp310-cp310-win_amd64.whl (266.5 kB view details)

Uploaded CPython 3.10 Windows x86-64

pydivsufsort-0.0.15-cp310-cp310-win32.whl (222.3 kB view details)

Uploaded CPython 3.10 Windows x86

pydivsufsort-0.0.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.15-cp310-cp310-macosx_11_0_arm64.whl (265.2 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

pydivsufsort-0.0.15-cp310-cp310-macosx_10_9_x86_64.whl (294.9 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

pydivsufsort-0.0.15-cp39-cp39-win_amd64.whl (267.1 kB view details)

Uploaded CPython 3.9 Windows x86-64

pydivsufsort-0.0.15-cp39-cp39-win32.whl (222.8 kB view details)

Uploaded CPython 3.9 Windows x86

pydivsufsort-0.0.15-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.15-cp39-cp39-macosx_11_0_arm64.whl (265.8 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

pydivsufsort-0.0.15-cp39-cp39-macosx_10_9_x86_64.whl (295.5 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file pydivsufsort-0.0.15.tar.gz.

File metadata

  • Download URL: pydivsufsort-0.0.15.tar.gz
  • Upload date:
  • Size: 302.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pydivsufsort-0.0.15.tar.gz
Algorithm Hash digest
SHA256 4b049dd3d30eae32829386ffbdd58e46144991571369771f063deec72fcdd3c1
MD5 3bff902e2fc40d854b2a156567947667
BLAKE2b-256 95e9702f83f6ab485b4c8367296656a67622b009a9f2795a288dd4e704239276

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ad3ba22fbf7c945694b07518c7e7a1c3ce48ee0518fd073f1ca9166de5fcc510
MD5 96e5ec64ebcdddfac4689503c706ea75
BLAKE2b-256 346ab728d0e87f9d0bb87fa20520b90b0623224d0c677cd68edca8da59954484

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp312-cp312-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 448e6523bee12cc5b1252499246505c42f581e2e96e426c5abb0547397498513
MD5 68844addb70263ffa066c8b4d9204afb
BLAKE2b-256 e23c1822b8480118ab91a3ae2e00a9877bb85a80baa15b008d62f26d0df597eb

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 22ac61e2caf2c16bb26969c8dcfdb55e694342d94792da98db6f666848ab9d7e
MD5 2aa830f587193913ec0afa0c705e54ed
BLAKE2b-256 b3d123b5703a0f491ea12af1a5f9f2f3fca79828d0807fe26d911155960df34b

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0a1d2e8d1da4c8c14d531f48f8b22016047744c68d8edd024b8385cc4fe7ad27
MD5 ac0fe2b485945dc61e6cadc4cf1f3b49
BLAKE2b-256 775cae9cf770f76121a56b61a158b8dff8d5c311e5cdce5ce00bab31f6f260df

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c67d72c76289e5ee498c1aaafd7ae18ec733b1fbd77af22edfc4b3e05d1afa20
MD5 e97150e416295b78fbe53ee5d5eb40ed
BLAKE2b-256 794e4db53aa49e2859053580eef57cef6ca1f58856241b44b459a65837a9cb6e

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c372905e8827031fd207bf7b669604f022a0b3ab453636fa44a04e09f4438658
MD5 ffda08868a10c87b1a30ca8422155cd8
BLAKE2b-256 4a01f4f787d5d9f910391d8cc3b50e91bb239bc0be3e9c69437d79cae6facc97

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp311-cp311-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 37c8cda5c2e886384e6b7957f2585e376338648c301be5d524a67e9a31b6a286
MD5 aa569da14351c074f1e83db278cf031b
BLAKE2b-256 2a39b7186df1558de3b4fbe0390fbcf4cffe804ba20e01e737aafc65299b2e6a

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bbeb4fa197eabe02df51a65b1b4d5c71800e0ac9a3b4fb70dfde48e067db5b57
MD5 0f6be44bc0c5eb8934ba2275bef5545f
BLAKE2b-256 6b131b59ed5c8fa6dbe2840846bc31d6f3267325dc49a27c3142317d3b1282d9

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 80d645aa036f43e987235165b2ab37fd7e6930227c0dfd97762bf1bf9447779f
MD5 e538d08557f868d5f0274f75ef09ddfe
BLAKE2b-256 ed8261ed1d8267766e7832d450ed7ea0d5f99fa15ccc9712d252a7a229e2f504

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 00c23a67d36d1c3e7ed195d4e92da4f44bea80d6174adeb5a5a893f70cb95a8a
MD5 9123c868b483c9c96814937d82539b0f
BLAKE2b-256 3aee68e7203683ba70ea5369498ef097c80822a55761328df9bb3a7bcc3ead21

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0066e95a2f4cdc5b62b8eb409c59aaf4fa0f4865954a9b99ec34062dfc1a360d
MD5 ec324d293c173b474252d48de5ebe0bd
BLAKE2b-256 45c6a4a871fcfbb71e591701c3131b1604965f34f36ac56795c5b08fa73935df

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp310-cp310-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 7a37334ae67a906be6832dd615df838b1eec96f85e9fed9c6c8ed0723f7174da
MD5 ee40ea4de2cb7837c46a7224197d3a24
BLAKE2b-256 78c26897d652a812646406cd479799ff85698368ae818a56d0df07858112ccd4

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a0bf1e74c33e87146e583a59283dc5a56d55e2e5c5db7e030fd73bf0f4572b98
MD5 f85a42c0bc569a3feb2cd78ce302f846
BLAKE2b-256 736981c5baeaaf519e22b028ff2ddaf2ea692c1ee3115c8254d8ea0fad5d0d20

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d4d79883e43ea901be54720db554cba74e3069a9cf886eac256c1ba1079e4bfe
MD5 7dc6ad477879a21b3bb4091f847d3361
BLAKE2b-256 3d7c3715342e66ea3adefc1d5a7eacdf8811d8db2f7977290e33a200e3e4da8f

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 92e047947ecea5fce4d936490707bfb42f778d7cc7b0679dc946953e9ede3897
MD5 d882110261a72473ce868ad2dbe89e50
BLAKE2b-256 45b2258480e0621381503219b3189f8722e57aceeeb08d9983bf0d8fb21fd5d6

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 eb421ec96663177436d521029760c680ee91ff4b57974afb7727841a7c58f015
MD5 81887db72a83c1145efa3aed8108e866
BLAKE2b-256 686889ed87b7558a7ce881da8e74813dfda1d8502c77752178e85a7ee11be3b9

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp39-cp39-win32.whl.

File metadata

  • Download URL: pydivsufsort-0.0.15-cp39-cp39-win32.whl
  • Upload date:
  • Size: 222.8 kB
  • Tags: CPython 3.9, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pydivsufsort-0.0.15-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 bd3e646df3f93bced13b037b3d83e3f179978f27e163a052618c3f5f9032c48b
MD5 1d1462dd4a8b3377b560a67ae96ef5a3
BLAKE2b-256 d3b80aa463fb59bc9659ace60fa1dcafdb84029eca1b77c8f3d385891b2ec9c1

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7413daaa4f60a688621b21f06174027e2bbb5cc8bf0937ced32554174c47fd11
MD5 3338075a81c3d3e817e169b801e3c804
BLAKE2b-256 5fc927130b8660f21e00bc5d9a7b568733846d04a705033ed9ee730fc452aea0

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 668a6c6d8f50c68d7594bfc994ce810ab740fa03f411c231cbc91a903c161113
MD5 d5a015381f4f4af960a34d2ef3f2b241
BLAKE2b-256 0bd604bc2f4a5ef6e3d2e762409e698913cf2ad0a3973f8f4cbda0ac74b1b402

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.15-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.15-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bc5b64cb18b76ae82014dc9cae69e2bef48dbd3b8913c7ef615035294a970da1
MD5 66761f0e6dfefb162bb985305351bd79
BLAKE2b-256 1eb2da458bbcdb8afd73669c31e6f06f8eb8bf56d68b35d7e798b88418f3f384

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page