Skip to main content

String algorithms

Project description

PyPI version Build Status

pydivsufsort: bindings to libdivsufsort

pydivsufsort prebuilds libdivsufsort as a shared library and includes it in a Python package with bindings.

Features:

  • bindings to divsufsort that return numpy arrays
  • handle almost any integer data type (e.g. int64) and not only char
  • additional string algorithms

Installation

On Linux, macOS and Windows:

python -m pip install pydivsufsort

We provide precompiled wheels for common systems, and a source distribution for Unix systems. Manual compilation on Windows might require some tweaking, please create an issue.

Usage

Using String Inputs

import numpy as np
from pydivsufsort import divsufsort, kasai

string_inp = "banana$"
string_suffix_array = divsufsort(string_inp)
string_lcp_array = kasai(string_inp, string_suffix_array)
print(string_suffix_array, string_lcp_array)
# [6 5 3 1 0 4 2] [0 1 3 0 0 2 0]

Using Integer Inputs

import numpy as np
from pydivsufsort import divsufsort, kasai

string_inp = "banana$"

# Convert the string input to integers first
int_inp = np.unique(np.array(list(string_inp)), return_inverse=True)[1]
int_suffix_array = divsufsort(int_inp)
int_lcp_array = kasai(int_inp, int_suffix_array)
print(int_suffix_array, int_lcp_array)
# [6 5 3 1 0 4 2] [0 1 3 0 0 2 0]

Using Multiple Sentinel Characters Witin A String

import numpy as np
from pydivsufsort import divsufsort, kasai

sentinel_inp = "a$banana#and@a*bandana+"
sentinel_suffix_array = divsufsort(sentinel_inp)
sentinel_lcp_array = kasai(sentinel_inp, sentinel_suffix_array)
print(sentinel_suffix_array, sentinel_lcp_array)
# [ 8  1 14 22 12  7  0 13 21  5 19  3  9 16  2 15 11 18  6 20  4 10 17] [0 0 0 0 0 1 1 1 1 3 3 2 3 0 3 0 1 0 2 2 1 2 0]

Testing

pytest

Technical details (for performance tweaks)

libdivsufsort is compiled in both 32 and 64 bits, as the 32 bits version is faster. pydivsufsort automatically chooses to use the 32 bits version when possible (aka when the input size is less than 2**31-1).

For best performance, use contiguous arrays. If you have a sliced array, pydivsufsort converts it automatically with numpy.ascontiguousarray.

The precompiled libraries use OpenMP. You can disable it by setting the env variable OMP_NUM_THREADS=1, and it will yield the same performance as the version compiled without OpenMP

The original libdivsufsort only supports char as the base type. pydivsufsort can handle arrays of any integer type (even signed), by encoding each element as multiple chars, which makes the computation slower. If your values use an integer type that is bigger than required, but they span over a small contiguous range, pydivsufsort will automatically change their type (see #6).

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pydivsufsort, version 0.0.5
Filename, size File type Python version Upload date Hashes
Filename, size pydivsufsort-0.0.5-cp36-cp36m-macosx_10_13_x86_64.whl (266.1 kB) File type Wheel Python version cp36 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp36-cp36m-manylinux2010_i686.whl (930.9 kB) File type Wheel Python version cp36 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl (1.2 MB) File type Wheel Python version cp36 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp36-cp36m-win32.whl (184.5 kB) File type Wheel Python version cp36 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp36-cp36m-win_amd64.whl (223.9 kB) File type Wheel Python version cp36 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp37-cp37m-macosx_10_13_x86_64.whl (266.5 kB) File type Wheel Python version cp37 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp37-cp37m-manylinux2010_i686.whl (926.3 kB) File type Wheel Python version cp37 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl (1.2 MB) File type Wheel Python version cp37 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp37-cp37m-win32.whl (184.5 kB) File type Wheel Python version cp37 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp37-cp37m-win_amd64.whl (224.0 kB) File type Wheel Python version cp37 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp38-cp38-macosx_10_13_x86_64.whl (275.5 kB) File type Wheel Python version cp38 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp38-cp38-manylinux2010_i686.whl (1.1 MB) File type Wheel Python version cp38 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp38-cp38-manylinux2010_x86_64.whl (1.4 MB) File type Wheel Python version cp38 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp38-cp38-win32.whl (189.0 kB) File type Wheel Python version cp38 Upload date Hashes View
Filename, size pydivsufsort-0.0.5-cp38-cp38-win_amd64.whl (232.2 kB) File type Wheel Python version cp38 Upload date Hashes View
Filename, size pydivsufsort-0.0.5.tar.gz (222.0 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page