Skip to main content

String algorithms

Project description

PyPI version Downloads Test Build and upload codecov DOI

pydivsufsort: bindings to libdivsufsort

pydivsufsort prebuilds libdivsufsort as a shared library and includes it in a Python package with bindings. Wheels are built for Linux, macOS and Windows (32 and 64 bits) using cibuildwheel and GitHub Actions. Basically, you should be able to install it with pip install pydivsufsort on any system and it should work out of the box. If it doesn't, please create an issue.

Features:

  • bindings to divsufsort that return numpy arrays
  • handle string, bytes and almost any integer data type (e.g. int64) and not only char
  • algorithms work even for non char inputs
  • additional string algorithms coded in Cython

Installation

On Linux, macOS and Windows:

python -m pip install pydivsufsort

We provide precompiled wheels for common systems using cibuildwheel, and a source distribution for Unix systems. Manual compilation on Windows might require some tweaking, please create an issue.

Features

All methods support string, bytes and numpy array inputs, including datatypes greater than uint8_t (e.g. uint64_t). Below are the signatures of all methods exposed by pydivsufsort. To import a method, just do from pydivsufsort import method_name. All methods are documented in the docstrings. You can display them with help(method_name).

A nicer interface to reuse computations lazily is provided in WonderString but currently undocumented. Please create an issue if you are interested.

Methods exposed from libdivsufsort

  • divsufsort(string): suffix array
  • bw_transform(string): Burrows-Wheeler transform
  • inverse_bw_transform(idx, string): inverse Burrows-Wheeler transform
  • sa_search(string, suffix_array, pattern): search for a pattern in a suffix array

Additional string algorithms

  • kasai(string, suffix_array=None): LCP array computation (lazily computes the suffix array if not provided)
  • lcp_segtree(string, suffix_array=None, lcp=None): build a segment tree for LCP queries (lazily computes the suffix array and LCP array if not provided)
  • lcp_query(segtree, queries): query a segment tree for LCP queries. Queries are pairs of indices.
  • levenshtein(string1, string2): Levenshtein distance
  • most_frequent_substrings(lcp, length, limit=0, minimum_count=1): most frequent substrings. See the docstring for details.
  • common_substrings(string1, string2, limit=25): common substrings between two strings.
  • min_rotation(string): minimum rotation of a string

Example usage

from pydivsufsort import divsufsort, kasai

string_inp = "banana$"
string_suffix_array = divsufsort(string_inp)
string_lcp_array = kasai(string_inp, string_suffix_array)
print(string_suffix_array, string_lcp_array)
# [6 5 3 1 0 4 2] [0 1 3 0 0 2 0]

# You can also convert the string input to integers first

import numpy as np

int_inp = np.unique(np.array(list(string_inp)), return_inverse=True)[1]
int_suffix_array = divsufsort(int_inp)
int_lcp_array = kasai(int_inp, int_suffix_array)
print(int_suffix_array, int_lcp_array)
# [6 5 3 1 0 4 2] [0 1 3 0 0 2 0]

Development

You can install locally with

pip install -e .

A useful command to iterate quickly when changing Cython code is

python setup.py build_ext --inplace && pytest -s

Profiling

Profiling can be activated with the environment variable PROFILE:

PROFILE=1 python setup.py build_ext --inplace && pytest -s

Here is an example with line_profiler (requires pip install "line_profiler<4"):

import line_profiler
from pydivsufsort import common_substrings
from pydivsufsort.stringalg import (
    _common_substrings,
    repeated_substrings,
)

s1 = "banana" * 10000
s2 = "ananas" * 10000

func = common_substrings
profile = line_profiler.LineProfiler(func)
profile.add_function(_common_substrings)
profile.add_function(repeated_substrings)
profile.runcall(func, s1, s2, limit=15)
profile.print_stats()

Testing

pytest

Technical details (for performance tweaks)

libdivsufsort is compiled in both 32 and 64 bits, as the 32 bits version is faster. pydivsufsort automatically chooses to use the 32 bits version when possible (aka when the input size is less than 2**31-1).

For best performance, use contiguous arrays. If you have a sliced array, pydivsufsort converts it automatically with numpy.ascontiguousarray.

The precompiled libraries use OpenMP. You can disable it by setting the env variable OMP_NUM_THREADS=1, and it will yield the same performance as the version compiled without OpenMP

The original libdivsufsort only supports char as the base type. pydivsufsort can handle arrays of any integer type (even signed), by encoding each element as multiple chars, which makes the computation slower. If your values use an integer type that is bigger than required, but they span over a small contiguous range, pydivsufsort will automatically change their type (see #6).

Acknowledgements

Citing

If you have used this software in a scientific publication, please cite it using the following BibLaTeX code:

@software{pydivsufsort,
  author       = {Louis Abraham},
  title        = {pydivsufsort},
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.7932458},
  url          = {https://github.com/louisabraham/pydivsufsort}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydivsufsort-0.0.0.tar.gz (302.2 kB view details)

Uploaded Source

Built Distributions

pydivsufsort-0.0.0-cp312-cp312-win_amd64.whl (264.9 kB view details)

Uploaded CPython 3.12 Windows x86-64

pydivsufsort-0.0.0-cp312-cp312-win32.whl (220.3 kB view details)

Uploaded CPython 3.12 Windows x86

pydivsufsort-0.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.0-cp312-cp312-macosx_11_0_arm64.whl (268.3 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

pydivsufsort-0.0.0-cp312-cp312-macosx_10_9_x86_64.whl (298.0 kB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

pydivsufsort-0.0.0-cp311-cp311-win_amd64.whl (267.7 kB view details)

Uploaded CPython 3.11 Windows x86-64

pydivsufsort-0.0.0-cp311-cp311-win32.whl (222.1 kB view details)

Uploaded CPython 3.11 Windows x86

pydivsufsort-0.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.0-cp311-cp311-macosx_11_0_arm64.whl (265.1 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

pydivsufsort-0.0.0-cp311-cp311-macosx_10_9_x86_64.whl (295.3 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

pydivsufsort-0.0.0-cp310-cp310-win_amd64.whl (266.5 kB view details)

Uploaded CPython 3.10 Windows x86-64

pydivsufsort-0.0.0-cp310-cp310-win32.whl (222.3 kB view details)

Uploaded CPython 3.10 Windows x86

pydivsufsort-0.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.0-cp310-cp310-macosx_11_0_arm64.whl (265.1 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

pydivsufsort-0.0.0-cp310-cp310-macosx_10_9_x86_64.whl (294.8 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

pydivsufsort-0.0.0-cp39-cp39-win_amd64.whl (267.1 kB view details)

Uploaded CPython 3.9 Windows x86-64

pydivsufsort-0.0.0-cp39-cp39-win32.whl (222.8 kB view details)

Uploaded CPython 3.9 Windows x86

pydivsufsort-0.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.0-cp39-cp39-macosx_11_0_arm64.whl (265.8 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

pydivsufsort-0.0.0-cp39-cp39-macosx_10_9_x86_64.whl (295.5 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file pydivsufsort-0.0.0.tar.gz.

File metadata

  • Download URL: pydivsufsort-0.0.0.tar.gz
  • Upload date:
  • Size: 302.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pydivsufsort-0.0.0.tar.gz
Algorithm Hash digest
SHA256 485ae849e9d45fc1684aeff678ac8c8fde25ec0f3bc730e059500792db3333db
MD5 d8329ef76ed64253d54fe48cd92402cd
BLAKE2b-256 64035c56306e70ad76736fb28169291709f3089f0255444973c99885a2ed9b71

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 df72d59932d58e94b0a5abc247209d8fe5d7c7dba1726df3bc81e34da45640f9
MD5 b1ca946af355faccbd980b1f188b1d37
BLAKE2b-256 ac47348470ccc485121e6e28b97ff382c3c5c66d30abb22230d0bffb524420f2

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp312-cp312-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 d9fcb8b64af9fde5f51fc7f80c7106d2afd21f832402023ea21b1e0cc5be9be1
MD5 122cffce8f77c8282896adc5fd631f0d
BLAKE2b-256 d577172c047ab259e60d691818da8be0028c305924c4017ab93dfe4f9406f996

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cc3591283e64af3d8d5950d2547da5a6e1ba504acd1d6e9170e6a9c1db80024a
MD5 0511e3f4199009b8979b82a6b306fe60
BLAKE2b-256 98692f6f4198791e53d85630acd8109c57fee4f1be9c30a14c66269855cfdd0c

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a60aa0f91c4f9c5b16b25bdb0ecca4aa23f9b8b08310a16401ddbade904391ea
MD5 a195a758aa8df7fc721717eec20f7636
BLAKE2b-256 e9e01bcc7ef8ff6831bcfe8dd73037b3162e83b011f62c54c737aeb3b74bf8be

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6d8b2b3774eef2a777abbd23574e65cc1b48ccc726714790e7229774817aced1
MD5 d8694b155553525e4b5d579687697a5d
BLAKE2b-256 6a083af2a46b6e4ac76a1718ca387d8751f1ad0f628b340df7afd4b02a869bb3

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 49121b49f0b79311f8bdc35a7a79827b41034d3363c7b40967d20bc93d776524
MD5 f6e61c950eaea874da472d06770c16a0
BLAKE2b-256 a64207745a934060331c3916928ff124638436fafe58a630b2ace0b7878cd6c9

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp311-cp311-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 9a264198718c6e8140fb8a3728895b5d89f17cb8bd5e4efac71842d39e054edc
MD5 e9b41aa39b4ab33140e6d4fa2f0d2c24
BLAKE2b-256 ec4eab778db476974eea411a3e13cb2f95c0869e9929959878b610bf726f7d62

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7ae2d75b91cb3f6244d95da16f3766c2c812d945b014ba3cf08360f9d8830ce0
MD5 b06692f4bb57db98c2ea757ada573cbb
BLAKE2b-256 a12df20984cf33e38d9e184c84e2b723799b4a555245f746266db2207fd8df7c

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eb461dd41ea2bf0f066653a46946a822065cdf7ecabb3766a62ad5a27bc3e1d4
MD5 b17445564a0a746adc6ae85b198082a1
BLAKE2b-256 a93942527355d83b9ac9117f1e499bb23127632f486f149122ec613fbed90497

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 88f26d2ec906adc18ca755ea91888dab7089c46e8b82f619d4ab1ad647f2376b
MD5 38d51ac1883b07ceea34f6e15afd6c6f
BLAKE2b-256 5b776feafd19cca944d80364cd255dbea4f3660fd39efe01af7ae4a850b937e8

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 6046f22f7a5b9910e731bc97e60cc558b5fe63d4bdb2a27549e9d0a94e2389db
MD5 ce00ce48424405fb2c0719259056df76
BLAKE2b-256 fc322d0ac49277ccfeed3653662e5d898af9a4eb497b6a6419ef267778bf5dea

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp310-cp310-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 8fcdfe5179a14a498a75948db765436f1e1677a83ba3ab4a1e1a97707f5af1eb
MD5 e432967754d6c28acb25248bbf262ad1
BLAKE2b-256 b6bc9fd07c659d929ff1e9c5287e8f5ed97984b2874116714e4dd20c794b3a9b

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a29a2138576ae981661ff422094804d775be699adf5dc719d083bf5bc82b801b
MD5 e131d020dd07388b48676575c9650ab1
BLAKE2b-256 1d5d435ac2a0c334f265690416ca8c59fb3915b22d55dc2969adcfc2309f63e7

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc3c908c9c85de4e6e84e92b036203b8d0951966b4cbdcf58694a181cbc372d7
MD5 f88cbb973d09d4707ad62acfd1993bf1
BLAKE2b-256 0e322f2e19622098bcf66c9f3d3b1f96a7cc9d6345ee428ed2a758fe82d76c20

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 58e9a8f463be02fbc78af1719bff9b1b0927fdef2e2266c67213144087f0b2c3
MD5 2884b0e0c34f5f813ff548bd4c5661f5
BLAKE2b-256 4057c687d94579132ae425bfd82cceca236658ca3295ebfce779ee9af4e85d2a

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 ae85215f98f477e072352cac33823438eefba29da7b09b7788e1d6c4fb9eb635
MD5 c5de1f443de0eba4f370704262a12ff4
BLAKE2b-256 2b7025fd3194f38f4a1a0dd604b53e153dfc1f2fd1f0a8155b8e893b5af30371

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp39-cp39-win32.whl.

File metadata

  • Download URL: pydivsufsort-0.0.0-cp39-cp39-win32.whl
  • Upload date:
  • Size: 222.8 kB
  • Tags: CPython 3.9, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pydivsufsort-0.0.0-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 f7ba108dad3ba84135ee5cf217fc5a632ff4d30d58ed5a420f44db3d81a64b91
MD5 8f10a0a840536cb054e06d114c85a9f0
BLAKE2b-256 edab3c23c9c950e53cc35d684bc2c8bdb1264d72dcfadb01dfae7ae6aba89dda

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ed8b7f024f56e86f50306009d2d900e87a6673246e18094ef74ad35b4ff44a22
MD5 2a188066e0aa3d9a088842ae644a44cc
BLAKE2b-256 a2923db43eb0db504aa0f915a23c8f2845e6d77ff0c32458d38c7ad407bd34e1

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 676c2d183661c8732afa96fa554e30e15e735f40741ee0ec4d312b4764a830c4
MD5 9810c2c546b2ca59eba31930648aa9b9
BLAKE2b-256 aeb07b12343f90684e7e1b40c776dba88ce4148d80a23437edee0a08d91d9b0e

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 00dfd28b05665d73304e95fe8017d6f85354e4787afdac4a15ee4c6dc1c79c66
MD5 4958721d5e438b778e6c76af1ed6ea0b
BLAKE2b-256 8b37c3092f327af23c9d53abee5fe6ff4c9946e64b4c4ecde38f32517e994853

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page