Build Efficient RegExp Parsing Engines. Quickly process massive structured and semi-structured text files.
Project description
PyReParse
- Python Report Parser
PyReParse is a library that helps to ease the development processes of parsing huge complex structured reports.
PyReParse is a library that helps one create parsing engines for formed text reports. I had a such a need when I was tasked to parse a Financial Institution's archived transaction reports where the databases that held this data no-longer existed. So the data in the report forms was the only data available to re-create the original database. Thus, regular-expressions were used to find and capture certain field values, and validation calculations were needed to ensure that the data going into the database was complete and accurate.
Used for processing large legacy archived reports in a commercial banking.
Regexp pattern-based events trigger call-backs for...
- Perform accumulations by customer.
- Validations of values.
- Report on customers with specific report-value conditions.
Changes in v0.0.4
-
Added Money Handling
- Money can should be handled using Decimal rather than Float to remove the possibility of rounding errors.
- To use the money type for a field that is captured during parsing...
- Use the m_flds[] dictionary when defining the field name and what it captures:
m_flds['nsf_fee'] = ...
- Use the m_flds[] dictionary when defining the field name and what it captures:
-
Added Section and nested Subsection detection and counting.
-
Added validate_re_defs() function that validate the data structure of patterns that PyReParse uses to parse the structure text.
-
Added parallel execution by section
- parse_file()
- parse_file_parallel(file_path: str, max_workers: int = 4, parallel_depth: int = 1) -> List[Dict[str, Any]]
-
Added regexp caching so that regexps are compiled only once.
-
Implement a streaming API mode for low memory systems
- This overcomes the potential memory limitation that parse_file() and parse_file_parallel() have as those functions collect data into a memory structure. But if the files are extreemly large and have a lot of data, the memory structure of the captured data can also become quote large. So, where as the stream_matches() does not accumulate a memory structure of the data and instead, operates on the data as it streams in, much like the PyReParse.match() function already does. But if one uses stream_matches() on execute all logic in callbacks.
Benefits...
- The benefits of using PyReParse include...
- The use of a standard data structure for holding regular expressions.
Associated to the regexp are additional flags and fields that help to reduce the number of times a given regexp is executed.
- Regexp processing can be expensive. The goal is to run regexp matches only when they are needed. So if you know that the pattern for regexp A always occurs before regexp B, you can use the data structure to specify that regexp B should not be used until after regexp A triggers.
- All regular expressions and their associated properties are in one data structure.
- Additional benefits include the ability to cross-check a non-matching line with a simpler regexp that can catch lines that should have matched but did not, due to a need to tweak the main regexp, or possibly a corrupt input line.
- Logic for counting report lines and sections within a report.
- PyReParse uses named-capture-groups and returns captured values in a dictionary. This eases the ability to capture values for transformation and storage.
- One can associate a RegExp pattern to a callback so that one can perform custom calculations, validations, and transformations to the captured values of interest.
- Supports exact decimal arithmetic for financial data via
money2decimalto avoid float precision errors.
Installation...
# Use pip...
pip install pyreparse
# Pipenv...
pipenv install pyreparse
Hit: An LLM can help you get started...
As create the parsing rules data structure can be a bit daunting for someone new to regular expressions, it makes sense to feed PyReParse's Docs and example program to an LLM, along with examples of report-text and a description of the details and calculations that you want to perform. That would be a great way to get started.
Basic Usage Pattern
- Set up the named-regexp-pattern(s) with named-capture-groups data structure, along with associated properties (see example in test code...):
- Flags:
- Only once
- Start of Section
- Trigger ON/OFF
- trigger matching on or off based on another named regexp
- Optional Quick-Check RegExp)
- If the current named-regexp fails to match the given line. The quick-check regexp is run, and if a match occurs, warns that a regexp may have missed a line. So either the named-regexp is wrong, or the quick-check is produced a false positive.
- callback(<PyReParse_Instance>, <regexp_pattern_name>)
- On match, run the stated callback function to perform validations and processing logic. In fact, all processing logic can be implemented within callbacks.
- The Callback function is called when a match occurs and after fields have been captured.
- Callbacks can be used for field validation and event correlation, as the PyReParase instance (which contains the states of all regexp/fields), is available to the callback.
- Write the document processing logic...
- If all processing logic is implemented as callbacks, the main logic would look like...
-
# Import PyRePrase from pyreparse import PyReParse as PRP # Define callback functions... def on_pattern001(prp_instance, pat_name): if fld_name != 'pattern001': print(f'Got wrong pattern name [{pat_name}].') # Define our Regular Expression Patterns Data Structure... regexp_pats = { 'pattern_001': { InDEX_RE: '^Test\s+Pattern\s+(?P<pat_val>\d+)' <INDEX_RE...>: 'value', <INDEX_RE...>: 'value', INDEX_RE_CALLBACK: on_pattern_001 ... }, ... } # Create and Instance of PyRePrase prp = PRP(<regexp_pats>) # Open the input file... with open(file_path, 'r') as txt_file: # Process each line of the input file... for line in txt_file: # This call on prp.match(<input_line>) to process the line # against our data structure of regexp patterns. match_def, matched_fields = prp.match(line)
-
- With or without Callback, you can trigger logic when name-regexp fields match using (see tests as an example)...
-
... # Open the input file... with open(file_path, 'r') as txt_file: # Process each line of the input file... for line in txt_file: # This call on prp.match(<input_line>) to process the line # against our data structure of regexp patterns. pattern_name, matched_fields = prp.match(line) # Then, we have logic based on which pattern matched, # and/or values in captured fields... if match_def[0] = 'pattern_001': ... elif match_def[0] = 'pattern_002': ...
-
- If all processing logic is implemented as callbacks, the main logic would look like...
- Flags:
Please check out pyreparse_example.py, you can used this code as a template to guide you in the creation of your own parsing engine.
Precise Money Handling with Decimal
For precise money handling, use money2decimal() instead of money2float() to convert captured strings to decimal.Decimal. Import from decimal import Decimal and update sums/validations accordingly.
Example:
elif match_def == ['tx_line']:
m_flds = matched_fields
m_flds['nsf_fee'] = prp.money2decimal('nsf_fee', m_flds['nsf_fee'])
# Similar for other fields like 'tx_amt', 'balance'
txn_lines.append(m_flds)
# Validation sum
nsf_tot = Decimal('0')
for flds in txn_lines:
nsf_tot += flds['nsf_fee']
if nsf_tot == grand_total:
print(f'*** Section [{prp.section_count}] Parsing Completed.')
Parallel Section Processing
For large reports with many independent sections (e.g., 2500+ NSF sections), use parse_file_parallel(file_path, max_workers=4, parallel_depth=1):
- Automatically detects section boundaries (NEW_SECTION/END_OF_SECTION).
- Processes chunks in ThreadPoolExecutor (order preserved).
- Returns
List[Dict]withsection_start,fields_list(all matches),totalsstub. parallel_depth=1: Top-level parallel (subs serial); >1: Recurse subs.
CLI in example: python src/pyreparse/example/pyreparse_example.py file.txt --parallel-sections 1
Example:
sections = prp.parse_file_parallel('report.txt')
for sec in sections:
print(f'Section {sec[\"section_start\"]}: {len(sec[\"fields_list\"])} matches')
Perf: 2-4x speedup multi-core. Tests verify serial==parallel.
Streaming for Large Files
For very large files where loading the entire report into memory is impractical, use streaming methods like stream_matches() or parse_file_stream() to process line-by-line or section-by-section without buffering the full content.
stream_matches(file_path, callback=None): Yields individual(match_def, fields)tuples for each line, or calls a provided callback. Ideal for real-time processing or low-memory event-driven parsing.parse_file_stream(file_path, callback=None): Yields complete sections as dicts (similar toparse_file()), or calls a callback per section. Processes boundaries serially but streams content.
Memory benefits: These methods read the file iteratively (via open() and readlines() slices or line-by-line), avoiding full file loads. Use for GB-scale reports; memory usage stays constant regardless of file size, unlike parse_file() which builds full in-memory lists.
Example:
# Stream individual matches
for match_def, fields in prp.stream_matches('large_report.txt'):
if match_def:
print(f"Matched {match_def}: {fields}")
# Stream sections with callback
def process_section(sec):
print(f"Section {sec['section_start']}: {len(sec['fields_list'])} items")
list(prp.parse_file_stream('large_report.txt', callback=process_section))
CLI in example: python src/pyreparse/example/pyreparse_example.py file.txt --stream
The PyReParse Data Structure of Patterns
Patterns Validation
PyReParse automatically validates the patterns dictionary in load_re_lines() via validate_re_defs():
Checks Performed:
- Each pattern requires
INDEX_RE_STRING(non-empty string). INDEX_RE_FLAGS: Must be non-negative integer using only defined flags.INDEX_RE_TRIGGER_ON/INDEX_RE_TRIGGER_OFF: Valid Python syntax after symbol/variable replacement (AST-checked).- Trigger dependencies: No cycles in
{pattern_name}graph (DAG enforced). FLAG_NEW_SUBSECTION:trigger_onmust contain{parent_pattern}reference.
Errors: Raises ValueError (structural/flags/cycles/orphan subs) or TriggerDefException (syntax).
See tests/test_pyreparse.py::TestPyReParse.test_validate_re_defs* for examples.
This ensures robust configuration before compilation/processing.
Flags
Coding Triggers...
A trigger is a line of logic that references counters or pattern-names. Triggers can use the full depth of python expressions, and are compiled to a call back function for efficiency. The purpose of the trigger is to simply return true or false. For the trigger-on, the expression should return true if the RegExp Pattern is to be evaluated against the current and following lines. For trigger-off, it should evaluate to True so that it is not evaluated for the current and subsequent lines.
< Counters >
Counters are synbolic names that are enclosed in Less-Than and Greater-Than signs.
The following is te current list of supported report counters...
- <REPORT_LINE>
- The <REPORT_LINE> counter increments by 1 for each line that the match() method is called on.
- <SECTION_NUMBER>
- The <SECTION_NUMBER> counter increments by 1 for each time a match() occurs on a pattern that has the flag PyReParse.FLAG_NEW_SECTION.
- <SECTION_LINE>
- The <SECTION_LINE> counter increments by 1 for each line that the match() method is called on that is part of a section.
All counters start at 0.
{Pattern_Names}
Pattern names are symbolic references to the RegExp Patterns in the current PyReParse data structure. Each Pattern may be associated to triggers that tell the matcher when or when not to execute a match on a given pattern. Triggers improve the efficiency of RegExp processing by reducing the number of Regular Expressions that are executed on any given line. This can be very effective when processing a huge number of documents. The pattern name evaluates to True if the pattern has been matched, and False if the pattern as not been matched since the last "NEW_SECTION", A "New Section" occurs when a pattern that has the flag PyReParse.FLAG_NEW_SECTION matches the current line, and it triggers the reset of section counters.
Coding Callbacks...
You may also code callbacks that are executed when a pattern matches. The callback function is called when a pattern matches, and after the fields have been captured. The callback function is passed the PyReParse instance, and the name of the pattern that matched. The callback function can then use the PyReParse instance to access any currently captured fields, and perform any processing logic field value updates.
## License...
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyreparse-0.0.4.tar.gz.
File metadata
- Download URL: pyreparse-0.0.4.tar.gz
- Upload date:
- Size: 95.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3c05a6a65f8ab5057279422e59cdfe946dab1b9352ff840b5aef58e1e29909a
|
|
| MD5 |
e5f462915dfcc36af89fb96b85994906
|
|
| BLAKE2b-256 |
93e053a3a975f4bfe03d8fba780b66480dde9983f5275d19077b1c87657cbccf
|
File details
Details for the file pyreparse-0.0.4-py3-none-any.whl.
File metadata
- Download URL: pyreparse-0.0.4-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
870617c174acce822f21b3f4b205350da10a826ba25c1609d99cb1d8b245f054
|
|
| MD5 |
f02e0af57b29ea4ff990ae9c07062277
|
|
| BLAKE2b-256 |
69661fd1cb9a91d436891c8fb08bf728da487c4da497addc70caa7ce5d2cf19a
|