Skip to main content

A high-performance text pattern matching library built with Rust

Project description

Voluta

A high-performance Python library for searching text patterns using the Aho-Corasick algorithm. Built with Rust for blazing fast processing.

Features

  • Memory-mapped file processing for optimal performance with large files
  • Parallel processing option for multi-core utilization
  • Configurable chunk sizes for memory management and performance tuning
  • Direct byte matching for maximum control and performance
  • Returns full match information (start and end positions)
  • Case insensitive matching
  • Support for overlapping pattern matches

Installation

Prerequisites

  • Rust (latest stable)
  • Python 3.13
  • uv
  • just

Building from source

# Clone repository
git clone https://github.com/trustshield/voluta.git && cd voluta

# Setup environment
uv venv
source .venv/bin/activate
uv sync --dev

# Build
just build

# Test
just test

Installing the wheel

After building, you can install the wheel in another project:

# The wheel file will be in target/wheels/
pip install /path/to/voluta/target/wheels/voluta-*.whl

# Alternatively, install directly from GitHub
pip install git+https://github.com/trustshield/voluta.git

Usage

Basic usage

import voluta

# Create a TextMatcher with patterns to search for
# Case insensitivity and overlapping matching are enabled by default
matcher = voluta.TextMatcher(["error", "warning", "critical"])

# Match patterns in a file (line-by-line)
# Returns (line_num, start_pos, end_pos, pattern)
matches = matcher.match_file("path/to/large.log")
for line_num, start, end, pattern in matches:
    print(f"Found '{pattern}' on line {line_num}, positions {start}-{end}")

# Using memory-mapped matching (faster for large files)
# Returns (byte_offset, end_offset, pattern)
matches = matcher.match_file_memmap("path/to/large.log", None)  # use default chunk size
for start, end, pattern in matches:
    print(f"Found '{pattern}' at byte positions {start}-{end}")

# Using parallel memory-mapped matching (maximum performance)
matches = matcher.match_file_memmap_parallel("path/to/large.log", None, None)

Advanced usage

# Specify chunk size (in bytes)
chunk_size = 8 * 1024 * 1024  # 8MB
matches = matcher.match_file_memmap("path/to/large.log", chunk_size)

# Specify chunk size and number of threads
chunk_size = 4 * 1024 * 1024  # 4MB
n_threads = 8
matches = matcher.match_file_memmap_parallel("path/to/large.log", chunk_size, n_threads)

# Direct byte matching for maximum performance
with open("path/to/large.log", "rb") as f:
    content = f.read()  # Or load bytes from any source
    matches = matcher.match_bytes(content)
    for start, end, pattern in matches:
        print(f"Found '{pattern}' at positions {start}-{end}")

# Simple example of finding specific text patterns
text = "The fox jumped over the fence. The fox is quick."
matcher = voluta.TextMatcher(["fox", "jump", "quick"])
matches = matcher.match_bytes(text.encode())
for start, end, pattern in matches:
    context = text[max(0, start-5):min(len(text), end+5)]
    print(f"Found '{pattern}' at {start}-{end}: '...{context}...'")

# Finding overlapping patterns
text = "abcdefgh"
# Overlapping matches are enabled by default to find all possible matches
matcher = voluta.TextMatcher(["abcd", "bcde", "cdef"])
matches = matcher.match_bytes(text.encode())
for start, end, pattern in matches:
    print(f"Found '{pattern}' at {start}-{end}")
    
# Disable overlapping matches if needed
matcher = voluta.TextMatcher(["abcd", "bcde", "cdef"], overlapping=False)

# Case sensitivity control
text = "Hello WORLD"
# By default, case insensitivity is enabled
matcher = voluta.TextMatcher(["hello", "world"])  # Will match both Hello and WORLD
# Disable case insensitivity if needed
matcher = voluta.TextMatcher(["hello", "world"], case_insensitive=False)  # Will only match exact case

Performance

The memory-mapped approach is significantly faster than line-by-line processing, especially for large files. For optimal performance:

  • Use match_file_memmap_parallel for multi-core systems
  • For maximum control and performance, use match_bytes with pre-loaded content
  • Test different chunk sizes for your specific hardware (typically 4-16MB works well)
  • For files under 100MB, the performance difference may be less noticeable
  • Note that enabling overlapping matches may impact performance

Thanks

This library is a wrapper of BurntSushi/aho-corasick.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

voluta-0.0.1-cp313-cp313-manylinux_2_34_x86_64.whl (499.5 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

voluta-0.0.1-cp313-cp313-manylinux_2_34_aarch64.whl (462.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ ARM64

voluta-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl (450.0 kB view details)

Uploaded CPython 3.13macOS 10.14+ x86-64

voluta-0.0.1-cp313-cp313-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl (855.8 kB view details)

Uploaded CPython 3.13macOS 10.14+ universal2 (ARM64, x86-64)macOS 10.14+ x86-64macOS 11.0+ ARM64

File details

Details for the file voluta-0.0.1-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for voluta-0.0.1-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fefbaccb1041485bae0ffacaaccf7639a6dee9808597b71af224dfb693c6985d
MD5 155281e81b6d8316274461a37e3c845d
BLAKE2b-256 b603e965e470a9e3ece6f144568f9f8b5b45076894131b6b36d1566116c8dadd

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.0.1-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voluta-0.0.1-cp313-cp313-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for voluta-0.0.1-cp313-cp313-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 78290c04f06af7465036db1c560fee08b9d1e9b05b158a88f414d24c879233cf
MD5 46cd9f8f39fdeb567dec0c0bfdba8958
BLAKE2b-256 4f999d2ea446fe9bb5e095bb8faa756099af10918578b6ae1db28922a2d19e37

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.0.1-cp313-cp313-manylinux_2_34_aarch64.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voluta-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for voluta-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 e29235aa002f22c092ba252cbb5383aef5dfcaf3d35fc1f3a8c0a1e2d9870dc9
MD5 83ff9b81ab6c448a5d3bf4df5d67877a
BLAKE2b-256 d2ae99ae9991ecadddf5b252ba2c43a033dc8372f6c9dcac790ad11ec4ab1a6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voluta-0.0.1-cp313-cp313-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl.

File metadata

File hashes

Hashes for voluta-0.0.1-cp313-cp313-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl
Algorithm Hash digest
SHA256 7d9fa2d199bfc1042326dbd0e63b5ffc8685db96536e1a78b45136ef80866f45
MD5 2357f139a78b9c4f88655fec5f2def3a
BLAKE2b-256 c20330006a33fa9d5d71fa742d93f257ca7df9bfb7b06fe6338b3f79f6e67d01

See more details on using hashes here.

Provenance

The following attestation bundles were made for voluta-0.0.1-cp313-cp313-macosx_10_14_x86_64.macosx_11_0_arm64.macosx_10_14_universal2.whl:

Publisher: publish-to-pypi.yml on trustshield/voluta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page