Boyer-moore in pure python, search for unicode strings in large files very quickly
Project description
Boyer-Moore in pure python: search for unicode strings quickly in large files
This is an implementation of the Boyer-Moore substring search algorithm in pure python.
It is a shameless copy-paste of the python reference code provided here, with modifications to support the following additional features:
Searching in files without reading the whole file into memory, allowing handling of large files
Full unicode support
See the API documentation for more details.
Installing
Install from pip.
pip install boyermoore
Searching for all occurences of a substring in a file
>>> from boyermoore import search_file >>> >>> offsets = search_file("pattern!", "file.txt") # Find all occurrences of "pattern!" in file "file.txt" >>> offsets # Display found occurrences [12, 456, 10422] # Pattern occurs at byte offsets 12, 456, and 104222
Searching for the first occurence of a substring in a file
>>> from boyermoore import search_file >>> >>> offsets = search_file("pattern!", "file.txt", greedy=False) # Find the first occurrence of "pattern!" in file "file.txt" >>> offsets # Display found occurrences [12] # First occurrence of pattern is at byte offset 12
Performance / Speed test
The following section illustrates the average speed of the boyermoore.search_file function when searching for a unicode string in files of sizes ranging from 32MB to 4GB.
Test environment
The test was executed using Python 3.7.6 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz and 32 GB of RAM.
Test methodology
The test searches for all occurrences of a fixed unicode string in a series of test files. The unicode string is:
Hello नमस्ते Привет こんにちは
(“Hello” in English, followed by the Hindi translation, followed by the Russian translation, followed by the Japanese translation)
Each test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0) and one at the very end (byte offset of [file_length - pattern_length]).
Test results
The following table shows the times taken to search for all occurences of the unicode string “Hello नमस्ते Привет こんにちは” inside test files of various sizes.
File size |
Time (seconds) |
---|---|
32 MB |
0.31 |
64 MB |
0.55 |
128 MB |
1.07 |
256 MB |
1.96 |
512 MB |
3.87 |
1 GB |
7.56 |
4 GB |
31.01 |
Contributions
Contributions are welcome, please open a pull request at https://github.com/eriknyquist/boyermoore and ensure that:
All existing unit tests pass (run tests via python setup.py test)
New unit tests are added to cover any modified/new functionality (run python code_coverage.py to ensure that coverage is above 98%)
You will need to install packages required for development, these are listed in dev_requirements.txt:
pip install -r dev_requirements.txt
If you have any questions about / need help with contributions or unit tests, please contact Erik at eknyquist@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for boyermoore-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0dd4e7b29b8cabeb1114452a88241b2e5ad66bf914240151c132158dcfd87ddd |
|
MD5 | ead0a2b5b11abc52f461c8dbbbe558b2 |
|
BLAKE2b-256 | aeeaefa5a23bdb04044b8931d9ac5b00d7cb5831cce0560c167b413ab14dc761 |