Python library for fast substring/pattern search written in C++ leveraging Suffix Array Algorithm
Project description
Python library for fast substring/pattern search written in C++ leveraging Suffix Array Algorithm
Table of Contents
About The Project
PySubstringSearch is a library intended for searching over an index file for substring patterns. The library is written in C++ to achieve speed and efficiency. The library also uses Msufsort suffix array construction library for string indexing. The created index consists of the original text and a 32bit suffix array structs. The library relies on a proprietary container protocol to hold the original text along with the index in chunks of 512mb to evade the limitation of the Suffix Array Construction implementation.
Built With
Performance
Test was measured on a file containing 500MB of text
High number of results
Library | Function | Time | #Results | Improvement Factor |
---|---|---|---|---|
ripgrepy | Ripgrepy('text', '500mb').run().as_string | 82.1 ms ± 1.15 ms per loop | 10737 | 1.0x |
PySubstringSearch | reader.search('text') | 2.31 ms ± 142 µs per loop | 10737 | 35.5x |
Low number of results
Library | Function | Time | #Results | Improvement Factor |
---|---|---|---|---|
ripgrepy | Ripgrepy('text', '500mb').run().as_string | 101 ms ± 526 µs per loop | 251 | 1.0x |
PySubstringSearch | reader.search('text') | 55.9 µs ± 464 ns per loop | 251 | 1803.0x |
Prerequisites
In order to compile this package you should have GCC & Python development package installed.
- Fedora
sudo dnf install python3-devel gcc-c++
- Ubuntu 18.04
sudo apt install python3-dev g++-8
Installation
pip3 install PySubstringSearch
Usage
Create an index
import pysubstringsearch
# creating a new index file
# if a file with this name is already exists, it will be overwritten
writer = pysubstringsearch.Writer(
index_file_path='output.idx',
)
# adding entries to the new index
writer.add_entry('some short string')
writer.add_entry('another but now a longer string')
writer.add_entry('more text to add')
# making sure the data is dumped to the file
writer.finalize()
Search a substring within an index
import pysubstringsearch
# opening an index file for searching
reader = pysubstringsearch.Reader(
index_file_path='output.idx',
)
# lookup for a substring
reader.search('short')
>>> ['some short string']
# lookup for a substring
reader.search('string')
>>> ['some short string', 'another but now a longer string']
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Gal Ben David - wavenator@gmail.com
Project Link: https://github.com/wavenator/PySubstringSearch
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.