Skip to main content

A high-performance text pattern matching library built with Rust

Project description

Voluta

A high-performance Python library for searching text patterns using the Aho-Corasick algorithm. Built with Rust for blazing fast processing.

Features

  • Memory-mapped file processing for optimal performance with large files
  • Parallel processing option for multi-core utilization
  • Configurable chunk sizes for memory management and performance tuning
  • Direct byte matching for maximum control and performance
  • Returns full match information (start and end positions)
  • Case insensitive matching
  • Support for overlapping pattern matches

Using in your project

pip install voluta

Usage

Basic usage

import voluta

# Create a TextMatcher with patterns to search for
# Case insensitivity and overlapping matching are enabled by default
matcher = voluta.TextMatcher(["error", "warning", "critical"])

# Match patterns in a file (line-by-line)
# Returns (line_num, start_pos, end_pos, pattern)
matches = matcher.match_file("path/to/large.log")
for line_num, start, end, pattern in matches:
    print(f"Found '{pattern}' on line {line_num}, positions {start}-{end}")

# Using memory-mapped matching (faster for large files)
# Returns (byte_offset, end_offset, pattern)
matches = matcher.match_file_memmap("path/to/large.log", None)  # use default chunk size
for start, end, pattern in matches:
    print(f"Found '{pattern}' at byte positions {start}-{end}")

# Using parallel memory-mapped matching (maximum performance)
matches = matcher.match_file_memmap_parallel("path/to/large.log", None, None)

Advanced usage

# Specify chunk size (in bytes)
chunk_size = 8 * 1024 * 1024  # 8MB
matches = matcher.match_file_memmap("path/to/large.log", chunk_size)

# Specify chunk size and number of threads
chunk_size = 4 * 1024 * 1024  # 4MB
n_threads = 8
matches = matcher.match_file_memmap_parallel("path/to/large.log", chunk_size, n_threads)

# Direct byte matching for maximum performance
with open("path/to/large.log", "rb") as f:
    content = f.read()  # Or load bytes from any source
    matches = matcher.match_bytes(content)
    for start, end, pattern in matches:
        print(f"Found '{pattern}' at positions {start}-{end}")

# Simple example of finding specific text patterns
text = "The fox jumped over the fence. The fox is quick."
matcher = voluta.TextMatcher(["fox", "jump", "quick"])
matches = matcher.match_bytes(text.encode())
for start, end, pattern in matches:
    context = text[max(0, start-5):min(len(text), end+5)]
    print(f"Found '{pattern}' at {start}-{end}: '...{context}...'")

# Finding overlapping patterns
text = "abcdefgh"
# Overlapping matches are enabled by default to find all possible matches
matcher = voluta.TextMatcher(["abcd", "bcde", "cdef"])
matches = matcher.match_bytes(text.encode())
for start, end, pattern in matches:
    print(f"Found '{pattern}' at {start}-{end}")
    
# Disable overlapping matches if needed
matcher = voluta.TextMatcher(["abcd", "bcde", "cdef"], overlapping=False)

# Case sensitivity control
text = "Hello WORLD"
# By default, case insensitivity is enabled
matcher = voluta.TextMatcher(["hello", "world"])  # Will match both Hello and WORLD
# Disable case insensitivity if needed
matcher = voluta.TextMatcher(["hello", "world"], case_insensitive=False)  # Will only match exact case

Installation

Prerequisites

  • Rust (latest stable)
  • Python 3.12
  • uv
  • just

Building from source

# Clone repository
git clone https://github.com/trustshield/voluta.git && cd voluta

# Setup environment
uv venv
source .venv/bin/activate
uv sync --dev

# Build
just build

# Test
just test

Installing the wheel

After building, you can install the wheel in another project:

# The wheel file will be in target/wheels/
pip install /path/to/voluta/target/wheels/voluta-*.whl

# Alternatively, install directly from GitHub
pip install git+https://github.com/trustshield/voluta.git

Performance

The memory-mapped approach is significantly faster than line-by-line processing, especially for large files. For optimal performance:

  • Use match_file_memmap_parallel for multi-core systems
  • For maximum control and performance, use match_bytes with pre-loaded content
  • Test different chunk sizes for your specific hardware (typically 4-16MB works well)
  • For files under 100MB, the performance difference may be less noticeable
  • Note that enabling overlapping matches may impact performance

Metrics

On a MacBook Pro M1 Pro with 16GB RAM:

% just stress 1 50 32 8
python tests/benchmark/stress.py --size 1 --patterns 50 --chunk 32 --threads 8
Generating 50 random search patterns...
Generating 1.0GB test file with 50 search patterns...
Progress: 100% complete
Created test file at /var/folders/65/6343wbc565jcmgj3mpvktl880000gp/T/tmpl0uwzhss.txt, size: 1.00GB
Inserted 1024247 pattern instances

Running stress test with 50 patterns:
  - File size: 1.00GB
  - Chunk size: 32MB
  - Threads: 8

Testing memory-mapped matching...
Memory-mapped matching: 1107062 matches in 4.59 seconds
Processing speed: 223.13MB/s

Testing parallel memory-mapped matching...
Parallel memory-mapped matching: 1107062 matches in 0.63 seconds
Processing speed: 1629.94MB/s

Parallel processing is 7.30x faster than single-threaded

Sample matches:
   'b37lBbWUl4u' found at byte positions 790320349-790320360
   'OsoI' found at byte positions 619636284-619636288
   'KGcWelcw6Awl7d4' found at byte positions 952973106-952973121
   'YlvzcXcF' found at byte positions 481316276-481316284
   'BvK' found at byte positions 909977231-909977234

Stress test completed successfully!

Cleaning up temporary test file: /var/folders/65/6343wbc565jcmgj3mpvktl880000gp/T/tmpl0uwzhss.txt

Thanks

This library is a wrapper of BurntSushi/aho-corasick.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

voluta-0.2.2-cp310-cp310-win_amd64.whl (904.7 kB view details)

Uploaded CPython 3.10Windows x86-64

voluta-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl (501.8 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

voluta-0.2.2-cp310-cp310-manylinux_2_34_aarch64.whl (466.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ ARM64

voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.whl (455.4 kB view details)

Uploaded CPython 3.10macOS 10.14+ x86-64

voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl (864.4 kB view details)

Uploaded CPython 3.10macOS 10.14+ universal2 (ARM64, x86-64)macOS 10.14+ x86-64macOS 11.0+ ARM64

File details

Details for the file voluta-0.2.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: voluta-0.2.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 904.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for voluta-0.2.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 de90422fc5be63e4947cc1c0087b0967b6d0921b993e4c4d92684729583faaf0
MD5 831c436ef98d9e223607281973bb3b34
BLAKE2b-256 4ef4e6d53933de9d5ffb8f0a64c5d4421354de7723c731e1d1b5b64a87af0685

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.2.2-cp310-cp310-win_amd64.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voluta-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for voluta-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 235bc0283e97b4b3347aab540b1005e00b6fd32883c7f98d4f60ac188596e332
MD5 e9065d48df6647d3d6d093c4d15caf28
BLAKE2b-256 6491f7a703fc1e74b348b24d0cea44534e48a661fb2b3bbeb4a8c53ba03d42e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voluta-0.2.2-cp310-cp310-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for voluta-0.2.2-cp310-cp310-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 4e314efbce15aa5efff63b12e3c58f03d84a7fa9f360c407699cef07b676b75b
MD5 2595cb8f746d68ccbdb996496c8b0df5
BLAKE2b-256 c15acaa14cc96cdfb4c07f8ab3b99a22d7a3f41a0e5f43d1fcb969316d38ff32

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.2.2-cp310-cp310-manylinux_2_34_aarch64.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 379cccfab4a78a4b4da0c152907279ad739cb8f3ecad38617aaf5fcf2abdaaee
MD5 0158a5dddafcbef7c68dbc12967c1aa5
BLAKE2b-256 0501a091f278e74c957287fb6ad491765d4a2aeea2f7171c5124b750ed71792a

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl.

File metadata

File hashes

Hashes for voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl
Algorithm Hash digest
SHA256 acc68bf320214e47c9a851fb3daf7abdea13143bba7ee50db5da1d1e40fd05d3
MD5 e70eeb5f3ac2d493f82ca2d0defc0d77
BLAKE2b-256 6c877729eab63dfb6869a7d4ebfeb7120293cf12d190d38fbdf51d5d48044850

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.2.2-cp310-cp310-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page