A high-performance text pattern matching library built with Rust
Project description
Voluta
A high-performance Python library for searching text patterns using the Aho-Corasick algorithm. Built with Rust for blazing fast processing.
Features
- Memory-mapped file processing for optimal performance with large files
- Parallel processing option for multi-core utilization
- Configurable chunk sizes for memory management and performance tuning
- Direct byte matching for maximum control and performance
- Returns full match information (start and end positions)
- Case insensitive matching
- Support for overlapping pattern matches
Using in your project
pip install voluta
Usage
Basic usage
import voluta
# Create a TextMatcher with patterns to search for
# Case insensitivity and overlapping matching are enabled by default
matcher = voluta.TextMatcher(["error", "warning", "critical"])
# Match patterns in a file (line-by-line)
# Returns (line_num, start_pos, end_pos, pattern)
matches = matcher.match_file("path/to/large.log")
for line_num, start, end, pattern in matches:
print(f"Found '{pattern}' on line {line_num}, positions {start}-{end}")
# Using memory-mapped matching (faster for large files)
# Returns (byte_offset, end_offset, pattern)
matches = matcher.match_file_memmap("path/to/large.log", None) # use default chunk size
for start, end, pattern in matches:
print(f"Found '{pattern}' at byte positions {start}-{end}")
# Using parallel memory-mapped matching (maximum performance)
matches = matcher.match_file_memmap_parallel("path/to/large.log", None, None)
Advanced usage
# Specify chunk size (in bytes)
chunk_size = 8 * 1024 * 1024 # 8MB
matches = matcher.match_file_memmap("path/to/large.log", chunk_size)
# Specify chunk size and number of threads
chunk_size = 4 * 1024 * 1024 # 4MB
n_threads = 8
matches = matcher.match_file_memmap_parallel("path/to/large.log", chunk_size, n_threads)
# Direct byte matching for maximum performance
with open("path/to/large.log", "rb") as f:
content = f.read() # Or load bytes from any source
matches = matcher.match_bytes(content)
for start, end, pattern in matches:
print(f"Found '{pattern}' at positions {start}-{end}")
# Simple example of finding specific text patterns
text = "The fox jumped over the fence. The fox is quick."
matcher = voluta.TextMatcher(["fox", "jump", "quick"])
matches = matcher.match_bytes(text.encode())
for start, end, pattern in matches:
context = text[max(0, start-5):min(len(text), end+5)]
print(f"Found '{pattern}' at {start}-{end}: '...{context}...'")
# Finding overlapping patterns
text = "abcdefgh"
# Overlapping matches are enabled by default to find all possible matches
matcher = voluta.TextMatcher(["abcd", "bcde", "cdef"])
matches = matcher.match_bytes(text.encode())
for start, end, pattern in matches:
print(f"Found '{pattern}' at {start}-{end}")
# Disable overlapping matches if needed
matcher = voluta.TextMatcher(["abcd", "bcde", "cdef"], overlapping=False)
# Case sensitivity control
text = "Hello WORLD"
# By default, case insensitivity is enabled
matcher = voluta.TextMatcher(["hello", "world"]) # Will match both Hello and WORLD
# Disable case insensitivity if needed
matcher = voluta.TextMatcher(["hello", "world"], case_insensitive=False) # Will only match exact case
Installation
Prerequisites
- Rust (latest stable)
- Python 3.12
- uv
- just
Building from source
# Clone repository
git clone https://github.com/trustshield/voluta.git && cd voluta
# Setup environment
uv venv
source .venv/bin/activate
uv sync --dev
# Build
just build
# Test
just test
Installing the wheel
After building, you can install the wheel in another project:
# The wheel file will be in target/wheels/
pip install /path/to/voluta/target/wheels/voluta-*.whl
# Alternatively, install directly from GitHub
pip install git+https://github.com/trustshield/voluta.git
Performance
The memory-mapped approach is significantly faster than line-by-line processing, especially for large files. For optimal performance:
- Use
match_file_memmap_parallelfor multi-core systems - For maximum control and performance, use
match_byteswith pre-loaded content - Test different chunk sizes for your specific hardware (typically 4-16MB works well)
- For files under 100MB, the performance difference may be less noticeable
- Note that enabling overlapping matches may impact performance
Metrics
On a MacBook Pro M1 Pro with 16GB RAM:
% just stress 1 50 32 8
python tests/benchmark/stress.py --size 1 --patterns 50 --chunk 32 --threads 8
Generating 50 random search patterns...
Generating 1.0GB test file with 50 search patterns...
Progress: 100% complete
Created test file at /var/folders/65/6343wbc565jcmgj3mpvktl880000gp/T/tmpl0uwzhss.txt, size: 1.00GB
Inserted 1024247 pattern instances
Running stress test with 50 patterns:
- File size: 1.00GB
- Chunk size: 32MB
- Threads: 8
Testing memory-mapped matching...
Memory-mapped matching: 1107062 matches in 4.59 seconds
Processing speed: 223.13MB/s
Testing parallel memory-mapped matching...
Parallel memory-mapped matching: 1107062 matches in 0.63 seconds
Processing speed: 1629.94MB/s
Parallel processing is 7.30x faster than single-threaded
Sample matches:
• 'b37lBbWUl4u' found at byte positions 790320349-790320360
• 'OsoI' found at byte positions 619636284-619636288
• 'KGcWelcw6Awl7d4' found at byte positions 952973106-952973121
• 'YlvzcXcF' found at byte positions 481316276-481316284
• 'BvK' found at byte positions 909977231-909977234
Stress test completed successfully!
Cleaning up temporary test file: /var/folders/65/6343wbc565jcmgj3mpvktl880000gp/T/tmpl0uwzhss.txt
Thanks
This library is a wrapper of BurntSushi/aho-corasick.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voluta-0.2.2-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: voluta-0.2.2-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 904.7 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de90422fc5be63e4947cc1c0087b0967b6d0921b993e4c4d92684729583faaf0
|
|
| MD5 |
831c436ef98d9e223607281973bb3b34
|
|
| BLAKE2b-256 |
4ef4e6d53933de9d5ffb8f0a64c5d4421354de7723c731e1d1b5b64a87af0685
|
Provenance
The following attestation bundles were made for voluta-0.2.2-cp310-cp310-win_amd64.whl:
Publisher:
publish-to-pypi.yml on trustshield/voluta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voluta-0.2.2-cp310-cp310-win_amd64.whl -
Subject digest:
de90422fc5be63e4947cc1c0087b0967b6d0921b993e4c4d92684729583faaf0 - Sigstore transparency entry: 201989839
- Sigstore integration time:
-
Permalink:
trustshield/voluta@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/trustshield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file voluta-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: voluta-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 501.8 kB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
235bc0283e97b4b3347aab540b1005e00b6fd32883c7f98d4f60ac188596e332
|
|
| MD5 |
e9065d48df6647d3d6d093c4d15caf28
|
|
| BLAKE2b-256 |
6491f7a703fc1e74b348b24d0cea44534e48a661fb2b3bbeb4a8c53ba03d42e2
|
Provenance
The following attestation bundles were made for voluta-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl:
Publisher:
publish-to-pypi.yml on trustshield/voluta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voluta-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl -
Subject digest:
235bc0283e97b4b3347aab540b1005e00b6fd32883c7f98d4f60ac188596e332 - Sigstore transparency entry: 201989851
- Sigstore integration time:
-
Permalink:
trustshield/voluta@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/trustshield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file voluta-0.2.2-cp310-cp310-manylinux_2_34_aarch64.whl.
File metadata
- Download URL: voluta-0.2.2-cp310-cp310-manylinux_2_34_aarch64.whl
- Upload date:
- Size: 466.6 kB
- Tags: CPython 3.10, manylinux: glibc 2.34+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e314efbce15aa5efff63b12e3c58f03d84a7fa9f360c407699cef07b676b75b
|
|
| MD5 |
2595cb8f746d68ccbdb996496c8b0df5
|
|
| BLAKE2b-256 |
c15acaa14cc96cdfb4c07f8ab3b99a22d7a3f41a0e5f43d1fcb969316d38ff32
|
Provenance
The following attestation bundles were made for voluta-0.2.2-cp310-cp310-manylinux_2_34_aarch64.whl:
Publisher:
publish-to-pypi.yml on trustshield/voluta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voluta-0.2.2-cp310-cp310-manylinux_2_34_aarch64.whl -
Subject digest:
4e314efbce15aa5efff63b12e3c58f03d84a7fa9f360c407699cef07b676b75b - Sigstore transparency entry: 201989856
- Sigstore integration time:
-
Permalink:
trustshield/voluta@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/trustshield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.whl.
File metadata
- Download URL: voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.whl
- Upload date:
- Size: 455.4 kB
- Tags: CPython 3.10, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
379cccfab4a78a4b4da0c152907279ad739cb8f3ecad38617aaf5fcf2abdaaee
|
|
| MD5 |
0158a5dddafcbef7c68dbc12967c1aa5
|
|
| BLAKE2b-256 |
0501a091f278e74c957287fb6ad491765d4a2aeea2f7171c5124b750ed71792a
|
Provenance
The following attestation bundles were made for voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.whl:
Publisher:
publish-to-pypi.yml on trustshield/voluta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.whl -
Subject digest:
379cccfab4a78a4b4da0c152907279ad739cb8f3ecad38617aaf5fcf2abdaaee - Sigstore transparency entry: 201989842
- Sigstore integration time:
-
Permalink:
trustshield/voluta@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/trustshield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl.
File metadata
- Download URL: voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl
- Upload date:
- Size: 864.4 kB
- Tags: CPython 3.10, macOS 10.14+ universal2 (ARM64, x86-64), macOS 10.14+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acc68bf320214e47c9a851fb3daf7abdea13143bba7ee50db5da1d1e40fd05d3
|
|
| MD5 |
e70eeb5f3ac2d493f82ca2d0defc0d77
|
|
| BLAKE2b-256 |
6c877729eab63dfb6869a7d4ebfeb7120293cf12d190d38fbdf51d5d48044850
|
Provenance
The following attestation bundles were made for voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl:
Publisher:
publish-to-pypi.yml on trustshield/voluta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voluta-0.2.2-cp310-cp310-macosx_10_14_universal2.macosx_10_14_x86_64.macosx_11_0_arm64.whl -
Subject digest:
acc68bf320214e47c9a851fb3daf7abdea13143bba7ee50db5da1d1e40fd05d3 - Sigstore transparency entry: 201989847
- Sigstore integration time:
-
Permalink:
trustshield/voluta@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/trustshield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@41d4f43e04476c336f89bf70f0eefc721d4d8c48 -
Trigger Event:
workflow_dispatch
-
Statement type: