Skip to main content

Python library for fast substring/pattern search written in C++ leveraging Suffix Array Algorithm

Project description

Logo

Python library for fast substring/pattern search written in C++ leveraging Suffix Array Algorithm

license Python Build PyPi

Table of Contents

About The Project

PySubstringSearch is a library intended for searching over an index file for substring patterns. The library is written in C++ to achieve speed and efficiency. The library also uses Msufsort suffix array construction library for string indexing. The created index consists of the original text and a 32bit suffix array structs. The library relies on a proprietary container protocol to hold the original text along with the index in chunks of 512mb to evade the limitation of the Suffix Array Construction implementation.

The module implements two methods, search_sequential & search_parallel. search_sequential searches through the inner chunks one by one where search_parallel searches concurrently. When dealing with big indices, bigger than 1gb for example, search_parallel would function faster. I advice to check them both with the resulted index to find which one fits better.

Built With

Performance

Library Text Size Function Time #Results Improvement Factor
ripgrepy 500mb Ripgrepy('text_one', '500mb').run().as_string.split('\n') 127 ms ± 694 µs per loop 12553 1.0x
PySubstringSearch 500mb reader.search_sequential('text_one') 2.48 ms ± 53.4 µs per loop 12553 51.2x
PySubstringSearch 500mb reader.search_parallel('text_one') 3.78 ms ± 350 µs per loop 12553 33.6x
ripgrepy 500mb Ripgrepy('text_two', '500mb').run().as_string.split('\n') 127 ms ± 623 µs per loop 769 1.0x
PySubstringSearch 500mb reader.search_sequential('text_two') 156 µs ± 916 ns per loop 769 814.0x
PySubstringSearch 500mb reader.search_parallel('text_two') 251 µs ± 80.2 µs per loop 769 506.0x
ripgrepy 6gb Ripgrepy('text_one', '6gb').run().as_string.split('\n') 1.38 s ± 3.82 ms 206884 1.0x
PySubstringSearch 6gb reader.search_sequential('text_one') 93.7 ms ± 2.16 ms per loop 206884 15.3x
PySubstringSearch 6gb reader.search_parallel('text_one') 34.3 ms ± 321 µs per loop 206884 40.5x
ripgrepy 6gb Ripgrepy('text_two', '6gb').run().as_string.split('\n') 1.61 s ± 37.2 ms per loop 6921 1.0x
PySubstringSearch 6gb reader.search_sequential('text_two') 2.22 ms ± 79.3 µs per loop 6921 725.2x
PySubstringSearch 6gb reader.search_parallel('text_two') 1.38 ms ± 26 µs per loop 6921 1166.6x

Prerequisites

In order to compile this package you should have GCC & Python development package installed.

  • Fedora
sudo dnf install python3-devel gcc-c++
  • Ubuntu 18.04
sudo apt install python3-dev g++-8

Installation

pip3 install PySubstringSearch

Usage

Create an index

import pysubstringsearch

# creating a new index file
# if a file with this name is already exists, it will be overwritten
writer = pysubstringsearch.Writer(
    index_file_path='output.idx',
)

# adding entries to the new index
writer.add_entry('some short string')
writer.add_entry('another but now a longer string')
writer.add_entry('more text to add')

# making sure the data is dumped to the file
writer.finalize()

Search a substring within an index

import pysubstringsearch

# opening an index file for searching
reader = pysubstringsearch.Reader(
    index_file_path='output.idx',
)

# lookup for a substring sequentially
reader.search_sequential('short')
>>> ['some short string']

# lookup for a substring sequentially
reader.search_sequential('string')
>>> ['some short string', 'another but now a longer string']

# lookup for a substring concurrently
reader.search_parallel('short')
>>> ['some short string']

# lookup for a substring concurrently
reader.search_parallel('string')
>>> ['some short string', 'another but now a longer string']

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - wavenator@gmail.com

Project Link: https://github.com/wavenator/PySubstringSearch

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PySubstringSearch-0.2.2.tar.gz (75.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page