Skip to main content

Build Efficient RegExp Parsing Engines. Quickly process massive structured and semi-structured text files.

Project description

PyReParse

  • Python Report Parser

PyReParse is a library that helps to ease the development processes of parsing huge complex structured reports.

PyReParse is a library that helps one create parsing engines for formed text reports. I had a such a need when I was tasked to parse a Financial Institution's archived transaction reports where the databases that held this data no-longer existed. So the data in the report forms was the only data available to re-create the original database. Thus, regular-expressions were used to find and capture certain field values, and validation calculations were needed to ensure that the data going into the database was complete and accurate.

Used for processing large legacy archived reports in a commercial banking.

Regexp pattern-based events trigger call-backs for...

  • Perform accumulations by customer.
  • Validations of values.
  • Report on customers with specific report-value conditions.

Changes in v0.0.4

  • Added Money Handling

    • Money can should be handled using Decimal rather than Float to remove the possibility of rounding errors.
    • To use the money type for a field that is captured during parsing...
      • Use the m_flds[] dictionary when defining the field name and what it captures: m_flds['nsf_fee'] = ...
  • Added Section and nested Subsection detection and counting.

  • Added validate_re_defs() function that validate the data structure of patterns that PyReParse uses to parse the structure text.

  • Added parallel execution by section

    • parse_file()
    • parse_file_parallel(file_path: str, max_workers: int = 4, parallel_depth: int = 1) -> List[Dict[str, Any]]
  • Added regexp caching so that regexps are compiled only once.

  • Implement a streaming API mode for low memory systems

    • This overcomes the potential memory limitation that parse_file() and parse_file_parallel() have as those functions collect data into a memory structure. But if the files are extreemly large and have a lot of data, the memory structure of the captured data can also become quote large. So, where as the stream_matches() does not accumulate a memory structure of the data and instead, operates on the data as it streams in, much like the PyReParse.match() function already does. But if one uses stream_matches() on execute all logic in callbacks.

Benefits...

  • The benefits of using PyReParse include...
  • The use of a standard data structure for holding regular expressions.

Associated to the regexp are additional flags and fields that help to reduce the number of times a given regexp is executed.

  • Regexp processing can be expensive. The goal is to run regexp matches only when they are needed. So if you know that the pattern for regexp A always occurs before regexp B, you can use the data structure to specify that regexp B should not be used until after regexp A triggers.
  • All regular expressions and their associated properties are in one data structure.
  • Additional benefits include the ability to cross-check a non-matching line with a simpler regexp that can catch lines that should have matched but did not, due to a need to tweak the main regexp, or possibly a corrupt input line.
  • Logic for counting report lines and sections within a report.
  • PyReParse uses named-capture-groups and returns captured values in a dictionary. This eases the ability to capture values for transformation and storage.
  • One can associate a RegExp pattern to a callback so that one can perform custom calculations, validations, and transformations to the captured values of interest.
  • Supports exact decimal arithmetic for financial data via money2decimal to avoid float precision errors.

Installation...

# Use pip...
pip install pyreparse

# Pipenv...
pipenv install pyreparse

Hit: An LLM can help you get started...

As create the parsing rules data structure can be a bit daunting for someone new to regular expressions, it makes sense to feed PyReParse's Docs and example program to an LLM, along with examples of report-text and a description of the details and calculations that you want to perform. That would be a great way to get started.

Basic Usage Pattern

  1. Set up the named-regexp-pattern(s) with named-capture-groups data structure, along with associated properties (see example in test code...):
    1. Flags:
      1. Only once
      2. Start of Section
    2. Trigger ON/OFF
      1. trigger matching on or off based on another named regexp
    3. Optional Quick-Check RegExp)
      1. If the current named-regexp fails to match the given line. The quick-check regexp is run, and if a match occurs, warns that a regexp may have missed a line. So either the named-regexp is wrong, or the quick-check is produced a false positive.
    4. callback(<PyReParse_Instance>, <regexp_pattern_name>)
      1. On match, run the stated callback function to perform validations and processing logic. In fact, all processing logic can be implemented within callbacks.
      2. The Callback function is called when a match occurs and after fields have been captured.
      3. Callbacks can be used for field validation and event correlation, as the PyReParase instance (which contains the states of all regexp/fields), is available to the callback.
    5. Write the document processing logic...
      1. If all processing logic is implemented as callbacks, the main logic would look like...
        1. # Import PyRePrase
          from pyreparse import PyReParse as PRP
          
          # Define callback functions...
          def on_pattern001(prp_instance, pat_name):
             if fld_name != 'pattern001':
                print(f'Got wrong pattern name [{pat_name}].')
          
          # Define our Regular Expression Patterns Data Structure...
          regexp_pats = {
             'pattern_001': {
                InDEX_RE: '^Test\s+Pattern\s+(?P<pat_val>\d+)'
                <INDEX_RE...>: 'value',
                <INDEX_RE...>: 'value',
                INDEX_RE_CALLBACK: on_pattern_001
                   ...
             },
             ...
          }
          
          # Create and Instance of PyRePrase
          prp = PRP(<regexp_pats>)
          
          # Open the input file...
          with open(file_path, 'r') as txt_file:
          
             # Process each line of the input file...
             for line in txt_file:
          
                # This call on prp.match(<input_line>) to process the line
                # against our data structure of regexp patterns.
                match_def, matched_fields = prp.match(line)
          
      2. With or without Callback, you can trigger logic when name-regexp fields match using (see tests as an example)...
        1. ...
          
          # Open the input file...
          with open(file_path, 'r') as txt_file:
          
             # Process each line of the input file...
             for line in txt_file:
          
                # This call on prp.match(<input_line>) to process the line
                # against our data structure of regexp patterns.
                pattern_name, matched_fields = prp.match(line)
          
                # Then, we have logic based on which pattern matched,
                # and/or values in captured fields...
                 if match_def[0] = 'pattern_001':
                      ...         
                 elif match_def[0] = 'pattern_002':
                      ...         
          

Please check out pyreparse_example.py, you can used this code as a template to guide you in the creation of your own parsing engine.

Precise Money Handling with Decimal

For precise money handling, use money2decimal() instead of money2float() to convert captured strings to decimal.Decimal. Import from decimal import Decimal and update sums/validations accordingly.

Example:

elif match_def == ['tx_line']:
    m_flds = matched_fields
    m_flds['nsf_fee'] = prp.money2decimal('nsf_fee', m_flds['nsf_fee'])
    # Similar for other fields like 'tx_amt', 'balance'
    txn_lines.append(m_flds)

# Validation sum
nsf_tot = Decimal('0')
for flds in txn_lines:
    nsf_tot += flds['nsf_fee']
if nsf_tot == grand_total:
    print(f'*** Section [{prp.section_count}] Parsing Completed.')

Parallel Section Processing

For large reports with many independent sections (e.g., 2500+ NSF sections), use parse_file_parallel(file_path, max_workers=4, parallel_depth=1):

  • Automatically detects section boundaries (NEW_SECTION/END_OF_SECTION).
  • Processes chunks in ThreadPoolExecutor (order preserved).
  • Returns List[Dict] with section_start, fields_list (all matches), totals stub.
  • parallel_depth=1: Top-level parallel (subs serial); >1: Recurse subs.

CLI in example: python src/pyreparse/example/pyreparse_example.py file.txt --parallel-sections 1

Example:

sections = prp.parse_file_parallel('report.txt')
for sec in sections:
  print(f'Section {sec[\"section_start\"]}: {len(sec[\"fields_list\"])} matches')

Perf: 2-4x speedup multi-core. Tests verify serial==parallel.

Streaming for Large Files

For very large files where loading the entire report into memory is impractical, use streaming methods like stream_matches() or parse_file_stream() to process line-by-line or section-by-section without buffering the full content.

  • stream_matches(file_path, callback=None): Yields individual (match_def, fields) tuples for each line, or calls a provided callback. Ideal for real-time processing or low-memory event-driven parsing.
  • parse_file_stream(file_path, callback=None): Yields complete sections as dicts (similar to parse_file()), or calls a callback per section. Processes boundaries serially but streams content.

Memory benefits: These methods read the file iteratively (via open() and readlines() slices or line-by-line), avoiding full file loads. Use for GB-scale reports; memory usage stays constant regardless of file size, unlike parse_file() which builds full in-memory lists.

Example:

# Stream individual matches
for match_def, fields in prp.stream_matches('large_report.txt'):
    if match_def:
        print(f"Matched {match_def}: {fields}")

# Stream sections with callback
def process_section(sec):
    print(f"Section {sec['section_start']}: {len(sec['fields_list'])} items")

list(prp.parse_file_stream('large_report.txt', callback=process_section))

CLI in example: python src/pyreparse/example/pyreparse_example.py file.txt --stream

The PyReParse Data Structure of Patterns


Patterns Validation

PyReParse automatically validates the patterns dictionary in load_re_lines() via validate_re_defs():

Checks Performed:

  • Each pattern requires INDEX_RE_STRING (non-empty string).
  • INDEX_RE_FLAGS: Must be non-negative integer using only defined flags.
  • INDEX_RE_TRIGGER_ON/INDEX_RE_TRIGGER_OFF: Valid Python syntax after symbol/variable replacement (AST-checked).
  • Trigger dependencies: No cycles in {pattern_name} graph (DAG enforced).
  • FLAG_NEW_SUBSECTION: trigger_on must contain {parent_pattern} reference.

Errors: Raises ValueError (structural/flags/cycles/orphan subs) or TriggerDefException (syntax).

See tests/test_pyreparse.py::TestPyReParse.test_validate_re_defs* for examples.

This ensures robust configuration before compilation/processing.

Flags

Coding Triggers...

A trigger is a line of logic that references counters or pattern-names. Triggers can use the full depth of python expressions, and are compiled to a call back function for efficiency. The purpose of the trigger is to simply return true or false. For the trigger-on, the expression should return true if the RegExp Pattern is to be evaluated against the current and following lines. For trigger-off, it should evaluate to True so that it is not evaluated for the current and subsequent lines.

< Counters >

Counters are synbolic names that are enclosed in Less-Than and Greater-Than signs.

The following is te current list of supported report counters...

  • <REPORT_LINE>
    • The <REPORT_LINE> counter increments by 1 for each line that the match() method is called on.
  • <SECTION_NUMBER>
    • The <SECTION_NUMBER> counter increments by 1 for each time a match() occurs on a pattern that has the flag PyReParse.FLAG_NEW_SECTION.
  • <SECTION_LINE>
    • The <SECTION_LINE> counter increments by 1 for each line that the match() method is called on that is part of a section.

All counters start at 0.

{Pattern_Names}

Pattern names are symbolic references to the RegExp Patterns in the current PyReParse data structure. Each Pattern may be associated to triggers that tell the matcher when or when not to execute a match on a given pattern. Triggers improve the efficiency of RegExp processing by reducing the number of Regular Expressions that are executed on any given line. This can be very effective when processing a huge number of documents. The pattern name evaluates to True if the pattern has been matched, and False if the pattern as not been matched since the last "NEW_SECTION", A "New Section" occurs when a pattern that has the flag PyReParse.FLAG_NEW_SECTION matches the current line, and it triggers the reset of section counters.

Coding Callbacks...

You may also code callbacks that are executed when a pattern matches. The callback function is called when a pattern matches, and after the fields have been captured. The callback function is passed the PyReParse instance, and the name of the pattern that matched. The callback function can then use the PyReParse instance to access any currently captured fields, and perform any processing logic field value updates.


## License...

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyreparse-0.0.4.tar.gz (95.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyreparse-0.0.4-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file pyreparse-0.0.4.tar.gz.

File metadata

  • Download URL: pyreparse-0.0.4.tar.gz
  • Upload date:
  • Size: 95.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyreparse-0.0.4.tar.gz
Algorithm Hash digest
SHA256 b3c05a6a65f8ab5057279422e59cdfe946dab1b9352ff840b5aef58e1e29909a
MD5 e5f462915dfcc36af89fb96b85994906
BLAKE2b-256 93e053a3a975f4bfe03d8fba780b66480dde9983f5275d19077b1c87657cbccf

See more details on using hashes here.

File details

Details for the file pyreparse-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: pyreparse-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyreparse-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 870617c174acce822f21b3f4b205350da10a826ba25c1609d99cb1d8b245f054
MD5 f02e0af57b29ea4ff990ae9c07062277
BLAKE2b-256 69661fd1cb9a91d436891c8fb08bf728da487c4da497addc70caa7ce5d2cf19a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page