Skip to main content

String algorithms

Project description

PyPI version Downloads Test Build and upload codecov DOI

pydivsufsort: bindings to libdivsufsort

pydivsufsort prebuilds libdivsufsort as a shared library and includes it in a Python package with bindings. Wheels are built for Linux, macOS and Windows (32 and 64 bits) using cibuildwheel and GitHub Actions. Basically, you should be able to install it with pip install pydivsufsort on any system and it should work out of the box. If it doesn't, please create an issue.

Features:

  • bindings to divsufsort that return numpy arrays
  • handle string, bytes and almost any integer data type (e.g. int64) and not only char
  • algorithms work even for non char inputs
  • additional string algorithms coded in Cython

Installation

On Linux, macOS and Windows:

python -m pip install pydivsufsort

We provide precompiled wheels for common systems using cibuildwheel, and a source distribution for Unix systems. Manual compilation on Windows might require some tweaking, please create an issue.

Features

All methods support string, bytes and numpy array inputs, including datatypes greater than uint8_t (e.g. uint64_t). Below are the signatures of all methods exposed by pydivsufsort. To import a method, just do from pydivsufsort import method_name. All methods are documented in the docstrings. You can display them with help(method_name).

A nicer interface to reuse computations lazily is provided in WonderString but currently undocumented. Please create an issue if you are interested.

Methods exposed from libdivsufsort

  • divsufsort(string): suffix array
  • bw_transform(string): Burrows-Wheeler transform
  • inverse_bw_transform(idx, string): inverse Burrows-Wheeler transform
  • sa_search(string, suffix_array, pattern): search for a pattern in a suffix array

Additional string algorithms

  • kasai(string, suffix_array=None): LCP array computation (lazily computes the suffix array if not provided)
  • lcp_segtree(string, suffix_array=None, lcp=None): build a segment tree for LCP queries (lazily computes the suffix array and LCP array if not provided)
  • lcp_query(segtree, queries): query a segment tree for LCP queries. Queries are pairs of indices.
  • levenshtein(string1, string2): Levenshtein distance
  • most_frequent_substrings(lcp, length, limit=0, minimum_count=1): most frequent substrings. See the docstring for details.
  • common_substrings(string1, string2, limit=25): common substrings between two strings.
  • min_rotation(string): minimum rotation of a string
  • longest_previous_factor(string, suffix_array=None, lcp=None): longest previous factor array (used in the Lempel-Ziv factorization)
  • lempel_zif_factorization(lpf, complexity: bool = False): Lempel-Ziv factorization
  • lempel_zif_complexity(string, suffix_array=None, lcp=None): Lempel-Ziv complexity

Example usage

from pydivsufsort import divsufsort, kasai

string_inp = "banana$"
string_suffix_array = divsufsort(string_inp)
string_lcp_array = kasai(string_inp, string_suffix_array)
print(string_suffix_array, string_lcp_array)
# [6 5 3 1 0 4 2] [0 1 3 0 0 2 0]

# You can also convert the string input to integers first

import numpy as np

int_inp = np.unique(np.array(list(string_inp)), return_inverse=True)[1]
int_suffix_array = divsufsort(int_inp)
int_lcp_array = kasai(int_inp, int_suffix_array)
print(int_suffix_array, int_lcp_array)
# [6 5 3 1 0 4 2] [0 1 3 0 0 2 0]

Development

You can install locally with

pip install -e .

A useful command to iterate quickly when changing Cython code is

python setup.py build_ext --inplace && pytest -s

Profiling

Profiling can be activated with the environment variable PROFILE:

PROFILE=1 python setup.py build_ext --inplace && pytest -s

Here is an example with line_profiler (requires pip install "line_profiler<4"):

import line_profiler
from pydivsufsort import common_substrings
from pydivsufsort.stringalg import (
    _common_substrings,
    repeated_substrings,
)

s1 = "banana" * 10000
s2 = "ananas" * 10000

func = common_substrings
profile = line_profiler.LineProfiler(func)
profile.add_function(_common_substrings)
profile.add_function(repeated_substrings)
profile.runcall(func, s1, s2, limit=15)
profile.print_stats()

Testing

pytest

Technical details (for performance tweaks)

libdivsufsort is compiled in both 32 and 64 bits, as the 32 bits version is faster. pydivsufsort automatically chooses to use the 32 bits version when possible (aka when the input size is less than 2**31-1).

For best performance, use contiguous arrays. If you have a sliced array, pydivsufsort converts it automatically with numpy.ascontiguousarray.

The precompiled libraries use OpenMP. You can disable it by setting the env variable OMP_NUM_THREADS=1, and it will yield the same performance as the version compiled without OpenMP

The original libdivsufsort only supports char as the base type. pydivsufsort can handle arrays of any integer type (even signed), by encoding each element as multiple chars, which makes the computation slower. If your values use an integer type that is bigger than required, but they span over a small contiguous range, pydivsufsort will automatically change their type (see #6).

Acknowledgements

Citing

If you have used this software in a scientific publication, please cite it using the following BibLaTeX code:

@software{pydivsufsort,
  author       = {Louis Abraham},
  title        = {pydivsufsort},
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.7932458},
  url          = {https://github.com/louisabraham/pydivsufsort}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydivsufsort-0.0.16.tar.gz (343.4 kB view details)

Uploaded Source

Built Distributions

pydivsufsort-0.0.16-cp312-cp312-win_amd64.whl (333.6 kB view details)

Uploaded CPython 3.12 Windows x86-64

pydivsufsort-0.0.16-cp312-cp312-win32.whl (276.3 kB view details)

Uploaded CPython 3.12 Windows x86

pydivsufsort-0.0.16-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.16-cp312-cp312-macosx_11_0_arm64.whl (346.6 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

pydivsufsort-0.0.16-cp312-cp312-macosx_10_9_x86_64.whl (384.6 kB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

pydivsufsort-0.0.16-cp311-cp311-win_amd64.whl (347.3 kB view details)

Uploaded CPython 3.11 Windows x86-64

pydivsufsort-0.0.16-cp311-cp311-win32.whl (280.4 kB view details)

Uploaded CPython 3.11 Windows x86

pydivsufsort-0.0.16-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.16-cp311-cp311-macosx_11_0_arm64.whl (343.7 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

pydivsufsort-0.0.16-cp311-cp311-macosx_10_9_x86_64.whl (383.2 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

pydivsufsort-0.0.16-cp310-cp310-win_amd64.whl (345.0 kB view details)

Uploaded CPython 3.10 Windows x86-64

pydivsufsort-0.0.16-cp310-cp310-win32.whl (280.6 kB view details)

Uploaded CPython 3.10 Windows x86

pydivsufsort-0.0.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.16-cp310-cp310-macosx_11_0_arm64.whl (343.7 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

pydivsufsort-0.0.16-cp310-cp310-macosx_10_9_x86_64.whl (382.0 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

pydivsufsort-0.0.16-cp39-cp39-win_amd64.whl (345.5 kB view details)

Uploaded CPython 3.9 Windows x86-64

pydivsufsort-0.0.16-cp39-cp39-win32.whl (281.0 kB view details)

Uploaded CPython 3.9 Windows x86

pydivsufsort-0.0.16-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pydivsufsort-0.0.16-cp39-cp39-macosx_11_0_arm64.whl (344.3 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

pydivsufsort-0.0.16-cp39-cp39-macosx_10_9_x86_64.whl (382.8 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file pydivsufsort-0.0.16.tar.gz.

File metadata

  • Download URL: pydivsufsort-0.0.16.tar.gz
  • Upload date:
  • Size: 343.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pydivsufsort-0.0.16.tar.gz
Algorithm Hash digest
SHA256 ae63410608ba1a31aa308a54eb1ddf556cddb57108ad2fff0d6493991231caab
MD5 634b4ced8bb9cf2115e88e057e20f0b0
BLAKE2b-256 e4fa5cfe634b98b1abef266393583577ba94f4dc5ad3ce0fb7ecc4f5e421f3bc

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 2b7e227fa34f07e4ac9c426ec219091a95154f45434df01c42d3d6549b9a13b6
MD5 eb3951aa19530f535d4d7f74f4843d7e
BLAKE2b-256 3200be0fd05c5a6bb49baae4315d4a4d5a0414deecc0d5390f307f1810fc9efd

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp312-cp312-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 5763ddfcd7ce91c989967e687355bff12e21bc97b4ced4b288966c59944bb462
MD5 2fea41a837fd1c42baf98740935545d2
BLAKE2b-256 26f54088f1ca2df28d52c0520d6561124c949b1b3015329d417b2dcdf63bf2ad

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eb6d491d8bb76d00176090af770c328f3cc5c2569fe5f0cd64116a98a850dcba
MD5 f4d29dceaea1c563ca5efa4cc25aa1cf
BLAKE2b-256 0d3aaf3dfd8af7d77af6f2b1cf089c518bb3a4f61129f1b383cffafc22d2a611

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fb87b975baf85b7e970f26b259dc58ef9da423f6e41bba83dc8001225eb87262
MD5 800d068d083a6e1e819b07733971b2fd
BLAKE2b-256 20e5d189e4f75bf5eb823a070c5799a58b41b6b43d6529421f8e6eb05e344e95

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 566d5c11dd3139cd4d4f7678d9bf2536527baffd716ba0c26a869ace13d69d3e
MD5 b59c8049c0bcaa39cbcedf4f964bb3c9
BLAKE2b-256 342203f3348687a92dcef131cd4a6c02e131542c59a7a43fbccf57317442741d

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 05a83efecea7c1c39de3728c35aaebe6f37003e6fdeb7f042084b84b9c6b35e3
MD5 d205714c3c0ad3df566db1832434b6c3
BLAKE2b-256 7fe8c04ef450851cad336974a0a8f7212dc7a77e99235a7dea93a16976a620a7

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp311-cp311-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 6b3e45c361f978f6e3861f3b00d33ecf459120084d6adbe3055ac6a8021aefbe
MD5 8fd7521698908a4d144b5befedd1f6e2
BLAKE2b-256 4b2eb2d5db64e491248517cf83c36dd3659cd152e75006590f49a52ef6ac8899

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d0bc6044a632a101110a782c13da5278512b942b30b8a24dc7d3d9f9c07b2567
MD5 0e9c0bd518daa63f9b03e74bff8badd3
BLAKE2b-256 d91d1c64ce2462d4ae14dc674504ada2e8cb5efee466b16e32444590bd1ab0bc

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2a9dcaf1d6c1c2d24efd325fee77af87a073f0fa02ce018e5ec044356c1f05a7
MD5 8038066e6988671b8bac59b0509f3434
BLAKE2b-256 798ac02efa5fb9776518c55278212f67f817800fe58841fd740989622958288d

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f606bb39c5a674ff15e6d7a8c2ba717b3677896e1e268b6677b9c2d2f0921852
MD5 ebec31ad367cd0e86d93da3df88c855f
BLAKE2b-256 926753c403606849979dfb6d27704633e94ae642b29a842ba04c4150c02f66e6

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 59c0db816b020c8d3c675df4c83c079a318776b0d8addbef1250d52a4ed1e553
MD5 29cfda2cccddff392da1b790951032e1
BLAKE2b-256 b7585f96a5bc6efe97c8a2b20776bf1a17b05067301b2f0bfb1f9bc8c9ab9f5a

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp310-cp310-win32.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 3a2f73861724f93b3e2db403a6cb3947963f6cbc73505f9d8cb0b4e12e8915b9
MD5 da218c67faa8db1c241fbe8bf8532bec
BLAKE2b-256 7568fdf9fa3a6d74f4f92027654eeb0fa60513a04f54c971340d2089b3287abf

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f6eb64e8ec2c8b77ab98fe0fe0b93e5abed2f242def6201cba144f7acbc11c76
MD5 5a48f4cb52ee63661934d956719f3950
BLAKE2b-256 781265171ebf4a6abf9158faf17e6ace3a2cb669f2033b432cf2286e79f8466c

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0a40772ef2db2591ce1095d63163a9aa2dd07db4b9f06d775804b9c3a03cec81
MD5 11e8635e27915e72a3101f34d815ab53
BLAKE2b-256 54264757cf52257bc8cf58c58998b18ed604577a09f12b6681c8a67bd2770fb3

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1b5cd74c0f07f49fc48e3c96f066256217451428ee350c16ccf9837a907c4550
MD5 217ccde438d188aecde4f66c28b2f8ab
BLAKE2b-256 cf02f94d8304d5cace83350cdc9e4f8023f8948a0d62a873a3829fa32ae0b471

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d3e13121916718d035db29b33e8fc5c3e3590076c09e2ac5935552e32e7f6656
MD5 d9dcc5eef2480be2cfed2192ecb3285e
BLAKE2b-256 6da1f6709330090593f4bd5401fd39fc0216490ebced87ea7a75ed307d983dc1

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp39-cp39-win32.whl.

File metadata

  • Download URL: pydivsufsort-0.0.16-cp39-cp39-win32.whl
  • Upload date:
  • Size: 281.0 kB
  • Tags: CPython 3.9, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pydivsufsort-0.0.16-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 a6e3922239043767248e919dabaf5474a73fae55dd32f0e032f424dc8b0bdd5e
MD5 efac4cb00f48948062686a21b32586be
BLAKE2b-256 cce9e31836b3eadbb46030e99179656f17283bce869c12d44eff41369093ff06

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3fddb69d5976bca8bed621613786440e2c8865fd7d467d31034dae8b536fa06b
MD5 5c7563d466db4673d6c49f439b13f3bb
BLAKE2b-256 460fba5b56d8d5b1313f96afffb5e43e571a7527fc7e4c0ec0f7c57560227c2a

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3999dcbe28e7ccc5ae4e5d7bdbd54b0402365782580a22e6bd533844f3ea4ffa
MD5 33a1afc565f737960ca41c95bc30403f
BLAKE2b-256 2c5f1e5a6e0aa3ff3f9c3308e765994c151217de5b26684784c99304501c6610

See more details on using hashes here.

File details

Details for the file pydivsufsort-0.0.16-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pydivsufsort-0.0.16-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c00764741fa6ce790c0fe4dd349a7d0cd19de6aa70c57dc942296c4631843b0a
MD5 81599775fff49f44b350c0d5412bfa95
BLAKE2b-256 2d914f936e31ee7fc274878697f6119fe72795fc20326c9e7e1fcfcb5f7c9817

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page