Skip to main content

Utilities for rapid text file processing using Vectorscan/Hyperscan in Python

Project description

VectorGrep

os: linux python: 3.10+ python style: google imports: isort code style: black code style: pycodestyle doc style: pydocstyle static typing: mypy linting: pylint testing: pytest security: bandit license: MIT

VectorGrep is a high-performance (Vectorized) Global Regular Expression "Processing" library for Python. It uses Vectorscan (a portable fork of Intel Hyperscan) to maximize performance, and can be used with multithreaded or multiprocessed applications. VectorGrep is also home to vectorgrep (Vectorized Global Regular Expression Printer), a multithreaded/multi-file "grep" command to search many files in parallel. It can often be used as a drop in replacement for grep/egrep/zgrep etc.

While a standard "grep" is designed to "print", VectorGrep is designed to allow full control over "processing". It supports scanning compressed, or uncompressed, text files for regular expressions, and customizing the action to take when a match is found. For full information about the performance of Vectorscan (and Hyperscan), refer to:
VectorCamp: Vectorscan
Intel: Hyperscan

VectorGrep also is the successor to HyperGrep. It is designed to be a drop in replacement for the original during initial releases. Refer to the FAQ for more information about this change.

Table Of Contents

Key Features

  • Simplicity
    • No experience with Vectorscan/Hyperscan required. Provides "grep" styled interfaces.
    • No external dependencies, and no building required (on natively supported platforms).
    • Built in support for compressed and uncompressed files.
  • Speed
    • Uses Vectorscan/Hyperscan, a high-performance multiple regex matching library.
    • Performs read and regex operations outside Python.
    • Batches results for Python, reducing overhead (customizable).
  • Parallelism
    • Bypasses GIL (Global Interpreter Lock) during read and regex operations to allow proper multithreading.
    • Python consumer threads (callbacks) are able to handle many producer threads (readers).

Compatibility

  • Supports Python 3.10+
  • Supports Linux systems with x86_64 architecture
    • Ubuntu Focal (20.04), Debian Bullseye (11), and above out of the box
    • Ubuntu Trusty (14.04) and above with gcc-9/g++-9 installed
    • Other Operating System configurations may work, but are not tested/guaranteed
      • Linux distros other than Debian/Ubuntu should work, assuming GLIBC is high enough
      • May be able to be built on Windows/OSX manually
      • More platforms are planned to be supported (natively) in the future
  • Some regex constructs are not supported by Vectorscan/Hyperscan in order to guarantee stable performance

Getting Started

Installation

  • Install VectorGrep via pip:

    pip install vectorgrep
    
  • Or via git clone:

    git clone <path to fork>
    cd vectorgrep
    pip install .
    
  • Or build and install from wheel:

    # Build locally.
    git clone <path to fork>
    cd vectorgrep
    make wheel
    
    # Push dist/vectorgrep*.tar.gz to environment where it will be installed.
    pip install dist/vectorgrep*.tar.gz
    

Examples

  • Read one file with the example single threaded command:

    # vectorgrep/scanner.py <regex> <file>
    vectorgrep/scanner.py pattern ./vectorgrep/scanner.py
    
  • Read multiple files with the multithreaded command (drop in replacement for grep where patterns are compatible):

    # From install:
    # vectorgrep <regex> <file(s)>
    vectorgrep pattern ./vectorgrep/scanner.py
    
    # From package:
    # vectorgrep/multiscanner.py <regex> <file>
    vectorgrep/multiscanner.py pattern ./vectorgrep/scanner.py
    
  • Collect all matches from a file, similar to grep, and perform a custom operation on results:

    import vectorgrep
    
    file = "./vectorgrep/scanner.py"
    pattern = 'pattern'
    
    results, return_code = vectorgrep.grep(file, [pattern])
    for index, line in results:
        print(f'{index}: {line}')
    
  • Manually scan a file and perform a custom operation on match:

    import vectorgrep
    
    file = "./vectorgrep/scanner.py"
    pattern = 'pattern'
    
    def on_match(matches: list, count: int) -> None:
        for index in range(count):
            match = matches[index]
            line = match.line.decode(errors='ignore')
            print(f'Custom print: {line.rstrip()}')
    
    vectorgrep.scan(file, [pattern], on_match)
    
  • Override the libhs and/or libzstd libraries to use files outside the package. Must be called before any other usage of vectorgrep:

    import vectorgrep
    
    vectorgrep.configure_libraries(
        libhs='/home/myuser/libhs.so.mybuild',
        libzstd='/home/myuser/libzstd.so.mybuild',
    )
    

Contributing

Refer to the Contributing Guide for information on how to contribute to this project.

Advanced Guides

Refer to How Tos for more advanced topics, such as building the shared library objects.

FAQ

Q: How does VectorGrep compare to other Vectorscan/Hyperscan python libraries?

A: VectorGrep has a specific goal: provide a high performance "grep" like interface in python, but with more control. It is not intended to be a full set of bindings to Vectorscan/Hyperscan. If you need full control over the low level backend, there are other python libraries intended for that use case. Here are a few of the reasons for the focused goal of this library:

  • Simplify developer integration.
    • No experience with Vectorscan/Hyperscan required.
    • Familiarity with grep variants beneficial, but not required.
  • Avoid messy subprocess chains common in "parallel grep" implementations.
    • Commands like zgrep are actually a zcat + grep. This can lead to 3+ processes per file read.
    • Subprocessing is messy in general, best to minimize its use as much as possible.
  • Optimize performance.
    • Reduce callbacks to/from python to reduce overhead.
    • Allow true multithreading during read and regex matching.
    • Provide the pattern matched in multi-regex searches, without having to repeat the search in Python.

When it comes to performance, here is an example of the benefit of this design. Due to the performance of Vectorscan/Hyperscan, it is also often faster than native grep variants, even while using python. Scenario setup:

  • 2.10GHz Intel x86_64 Processor
  • ~17M line file (~300M gzip compressed, ~3G uncompressed).
  • ~800 PCRE patterns.
  • Counting only, no extra processing of lines.
  • Each job run 5 times and averaged (lower is better).
Scenario (Uncompressed timings in parenthesis) VectorGrep Full bindings zgrep (grep)
1 ~90K matches, 1 pattern 8.2s (2.5s) 22.8s (15.5s) 12.5s (5.2s)
2 ~900K matches, 10 patterns 9.7s (3.8s) 25.7s (16.8s) 19.8s (17.3s)
3 ~15M matches, ~800 patterns 44.2s (38.1s) 73.5s (57.7s) *
4 Scenario #3 (x4 files), 1 process (4 threads) 49.6s (46.8s) 1432.6s (1302.2s) *
  • GNU grep does not allow multiple PCRE patterns natively, and concatenation via "or" failed.

Q: How do I make a custom build for a system other than Linux x86_64?

A: Refer to How To: Build the libraries for different architectures

Q: I only have an ARM CPU, can I build/run the x86_64 libraries?

A: Depends. The current production build supports native x86_64 CPUs, as well as virtualized (in most scenarios). For example, if you are on a Mac M1/M2/etc., you can use Docker and a supported image in x64 mode with --platform linux/amd64. Performance may vary however, as the code is running through virtual machine emulation. This process can also be used to build new libraries if your system is set up properly for emulation. Refer to How To: Build the libraries for different architectures for more information about supporting additional environments (natively or through emulation) besides Linux x86_64.

Q: Why was Vectorscan forked from Hyperscan?

A: Vectorscan was originally created to provide a portable fork of Hyperscan, and allow running on other architectures such as ARM. Intel changed the license of Hyperscan from BSD to IPL (Intel Proprietary License) starting in 5.5, while Vectorscan continues to provide updates and remain fully open source. For more information:
Vectorscan: Why was there a need for a fork?
Vectorscan: Hyperscan license change

Q: Why is VectorGrep not a fork of HyperGrep?

A: HyperGrep receives maintenance updates, but over time it will become a different solution from VectorGrep, and eventually become no longer updated, due to the licensing changes made by Intel to Hyperscan. In order to keep the responsibilities of each clearly separated, and avoid any confusion about backports or feature requests, it was decided to make a "clean cut" of HyperGrep, instead of using a "fork". There are no plans to backport any features from VectorGrep to HyperGrep. VectorGrep starts from HyperGrep commit 9c6f2b2. The original commit history can be found in HyperGrep History

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorgrep-1.2.0.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vectorgrep-1.2.0-py3-none-any.whl (4.9 MB view details)

Uploaded Python 3

File details

Details for the file vectorgrep-1.2.0.tar.gz.

File metadata

  • Download URL: vectorgrep-1.2.0.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for vectorgrep-1.2.0.tar.gz
Algorithm Hash digest
SHA256 dbce4c1433ec6957802944b1432e388479cbbf47c57196eee017f416483fe613
MD5 280001cfac2cc25d16a60a378014c445
BLAKE2b-256 1578e37825228b47874d31bcd6bc4719e0d7a708b4061619d2299a4eb2128e8c

See more details on using hashes here.

File details

Details for the file vectorgrep-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: vectorgrep-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for vectorgrep-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de268ee01738591e9dbc09811e0f8d032cb9b2faab233eb3c1bebd167a3935a8
MD5 7ab28a32f4f6e8c11292d332fa559586
BLAKE2b-256 36aae70a4c7848b0455ab854a2dc9a675e9cbe0d596cee798e3173267c70352a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page