SQL MATCH_RECOGNIZE for Pandas DataFrames

These details have not been verified by PyPI

Project links

Project description

SQL MATCH_RECOGNIZE on Pandas

Overview

This project brings SQL’s powerful MATCH_RECOGNIZE clause—used for pattern matching in sequences and event streams—directly to Pandas DataFrames. Our implementation allows users to run complex sequence detection logic in-memory within Python, removing the need for external databases like Trino, Oracle, or Flink.

It supports the SQL:2016 standard for MATCH_RECOGNIZE, including advanced features such as:

PARTITION BY, ORDER BY
Regex-style pattern syntax
DEFINE conditions
AFTER MATCH SKIP options
Support for anchors, quantifiers, alternation, and PERMUTE patterns

Motivation

Existing platforms like Oracle, Trino, and Flink offer robust implementations of MATCH_RECOGNIZE but come with significant complexity, licensing, or deployment overhead. Python's Pandas, despite its widespread use, lacks direct support for expressive pattern queries.

This project aims to close that gap by enabling SQL-native pattern detection in Pandas without sacrificing performance or expressiveness.

Key Features

🧠 SQL Query Parsing with ANTLR4 Fully customized SQL grammar extended from Trino to support all aspects of the MATCH_RECOGNIZE clause.
🌲 AST Construction SQL queries are parsed and transformed into abstract syntax trees for easier validation and execution.
⚙️ Finite Automata Engine
- Patterns are tokenized and translated to NFAs using Thompson’s construction.
- NFAs are converted to DFAs for efficient row-by-row evaluation.
- DFA optimizations include state minimization and prioritization.
📊 Execution on Pandas
- Data is partitioned and ordered per query.
- Patterns are matched directly on DataFrames.
- Results are formatted to resemble SQL query output.
🧪 Safety and Expressiveness
- Custom error listener for precise SQL diagnostics.
- SQL-to-Python conversion uses the ast module to safely evaluate expressions.

Architecture

flowchart TD
    SQL[SQL Query]
    Parse[ANTLR4 Parser]
    AST[AST Builder]
    Tokenize[Pattern Tokenizer]
    NFA[NFA Generator]
    DFA[DFA Optimizer]
    Executor[Match Executor]
    Output[Final DataFrame Output]

    SQL --> Parse --> AST --> Tokenize --> NFA --> DFA --> Executor --> Output

Example SQL Query

SELECT customer_id, start_price, bottom_price, final_price, start_date, final_date
FROM orders
MATCH_RECOGNIZE (
    PARTITION BY customer_id
    ORDER BY order_date
    MEASURES
        START.price AS start_price,
        LAST(DOWN.price) AS bottom_price,
        LAST(UP.price) AS final_price,
        START.order_date AS start_date,
        LAST(UP.order_date) AS final_date
    ONE ROW PER MATCH
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (START DOWN+ UP+)
    DEFINE
        DOWN AS price < PREV(price),
        UP AS price > PREV(price)
);

🚀 Installation

Prerequisites

Python 3.8+
pandas >= 1.0.0
numpy >= 1.18.0
antlr4-python3-runtime >= 4.9.0

📦 Install from Source (Development)

Clone the repository:

git clone https://github.com/MonierAshraf/Row_match_recognize.git
cd Row_match_recognize

Install dependencies:
```
pip install -r requirements.txt
```
Install the package in editable mode:
```
pip install -e .
```

📚 Install from PyPI (Production)

pip install pandas-match-recognize

🧪 Install from TestPyPI (Latest Development)

pip install -i https://test.pypi.org/simple/ pandas-match-recognize

📥 Install from Local Build

# Build the package yourself
python -m build
pip install dist/pandas_match_recognize-0.1.0-py3-none-any.whl

🔧 Verify Installation

# Test the installation
# Note: Package is installed as 'pandas-match-recognize' but imported as 'match_recognize'
from match_recognize import match_recognize
print("✅ Installation successful!")

💡 Quick Start Usage

Note: The package is installed as pandas-match-recognize (with hyphens) from PyPI, but imported as match_recognize (with underscores) in Python code. This is standard practice since Python module names cannot contain hyphens.

Customer Order Pattern Analysis

# Import the match_recognize function (installed from pandas-match-recognize package)
from match_recognize import match_recognize
import pandas as pd

# Customer order data
data = [
    ('cust_1', '2020-05-11', 100),
    ('cust_1', '2020-05-12', 200),
    ('cust_2', '2020-05-13',   8),
    ('cust_1', '2020-05-14', 100),
    ('cust_2', '2020-05-15',   4),
    ('cust_1', '2020-05-16',  50),
    ('cust_1', '2020-05-17', 100),
    ('cust_2', '2020-05-18',   6),
]

# Create DataFrame
df = pd.DataFrame(data, columns=['customer_id', 'order_date', 'price'])
df['order_date'] = pd.to_datetime(df['order_date'])

# Find V-shaped price patterns: START → DOWN+ → UP+
sql = """
SELECT customer_id, start_price, bottom_price, final_price, start_date, final_date
FROM orders
MATCH_RECOGNIZE (
    PARTITION BY customer_id
    ORDER BY order_date
    MEASURES
        START.price AS start_price,
        LAST(DOWN.price) AS bottom_price,
        LAST(UP.price) AS final_price,
        START.order_date AS start_date,
        LAST(UP.order_date) AS final_date
    ONE ROW PER MATCH
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (START DOWN+ UP+)
    DEFINE
        DOWN AS price < PREV(price),
        UP AS price > PREV(price)
);
"""

# Execute the query
result = match_recognize(sql, df)
print(result)

Output:

  customer_id  start_price  bottom_price  final_price start_date  final_date
0      cust_1          200            50          100 2020-05-12  2020-05-17
1      cust_2            8             4            6 2020-05-13  2020-05-18

🛠 Development Setup

For Contributors

Fork and clone:

git fork https://github.com/MonierAshraf/Row_match_recognize.git
git clone https://github.com/YOUR_USERNAME/Row_match_recognize.git
cd Row_match_recognize

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install development dependencies:

pip install -e .
pip install -r test_requirements.txt  # Testing dependencies

Run tests:

python -m pytest tests/test_anchor_patterns.py tests/test_back_reference.py tests/test_case_sensitivity.py tests/test_complete_java_reference.py tests/test_empty_cycle.py tests/test_empty_matches.py tests/test_exponential_protection.py tests/test_fixed_failing_cases.py tests/test_in_predicate.py tests/test_match_recognize.py tests/test_missing_critical_cases.py tests/test_multiple_match_recognize.py tests/test_navigation_and_conditions.py tests/test_output_layout.py tests/test_pattern_cache.py tests/test_pattern_tokenizer.py tests/test_permute_patterns.py tests/test_production_aggregates.py tests/test_scalar_functions.py tests/test_sql2016_compliance.py tests/test_subqueries.py --tb=short

🗑 Uninstallation

📚 Uninstall PyPI Installation:

pip uninstall pandas-match-recognize

🧪 Uninstall TestPyPI Installation:

pip uninstall pandas-match-recognize

📥 Uninstall Local Build Installation:

pip uninstall pandas-match-recognize

📦 Remove Editable/Development Installation:

# For editable installations from source
pip uninstall pandas-match-recognize row-match-recognize

# Remove local development files
rm -rf match_recognize/
rm -rf *.egg-info/
rm -rf build/
rm -rf dist/

🧹 Complete Cleanup:

# Clear pip cache
pip cache purge

# Remove any remaining configuration files
rm -rf ~/.cache/pip/wheels/match*

Verify uninstallation:

# Test from different directory
cd /tmp
python -c "
try:
    from match_recognize import match_recognize
    print('❌ Package still installed')
except ImportError:
    print('✅ Package successfully uninstalled')
"

📋 Troubleshooting

Common Issues

Package vs Module Name Confusion:

# ❌ WRONG - This will fail
from pandas_match_recognize import match_recognize  # Package name has hyphens

# ✅ CORRECT - Use underscores for import
from match_recognize import match_recognize  # Python module name

Import Error:

# If you get ModuleNotFoundError
import sys
import os
sys.path.append(os.path.join(os.getcwd(), 'src'))
from executor.match_recognize import match_recognize

Performance Issues:

Limit dataset size to < 1000 rows for optimal performance
Use specific PARTITION BY clauses to reduce processing overhead
Avoid overly complex nested patterns with multiple quantifiers

Memory Issues:

# Monitor memory usage for large patterns
import psutil
print(f"Memory usage: {psutil.virtual_memory().percent}%")

Conclusion and Future Work

Current Limitations

Despite the system's comprehensive capabilities, several limitations remain. First, complex pattern and quantifier interactions: although the system supports concatenation, alternation, grouping, and standard quantifiers (*, +, ?, {n,m}), certain combinations—particularly multiple greedy quantifiers nested within groups (e.g., (A+B*)+C?)—can trigger exponential state-space growth during automata construction. This issue primarily arises with three or more levels of nesting combined with unbounded quantifiers; by contrast, simpler patterns and bounded quantifiers behave efficiently. Second, limited support for aggregate functions: while a wide range of built-in aggregates (including conditional and statistical functions) is supported, the current implementation offers only limited support for user-defined aggregate functions.

Future Work

We plan to extend the engine through some updates:

Performance on Large Datasets: The system performs efficiently on moderate-sized datasets but may require additional optimizations for large datasets.

Memory Usage for Large Patterns: Patterns with many variables and complex quantifiers can generate large automata that increase memory consumption.

Integration with Query Optimizers: Because the pattern-matching engine currently operates independently of database query optimizers, it may miss plan-level optimization opportunities.

Conclusion

We presented a SQL-in-pandas engine for executing MATCH_RECOGNIZE queries over DataFrames. This provides SQL:2016 MATCH_RECOGNIZE functionality for pandas DataFrames, bridging the gap between the expressiveness of relational queries and the flexibility of in-memory analytics, bringing SQL pattern matching capabilities to Python data science workflows. This opens the door to unified and portable pipelines that preserve both semantics and developer productivity.

MATCH_RECOGNIZE allows data scientists and analysts to use powerful pattern-matching semantics directly within their familiar Pandas environment, without the need for complex Python code or external SQL engine dependencies. This reduces development complexity and enhances productivity for sequential data analysis across domains, including financial analysis, log processing, and time series pattern detection.

By addressing the identified limitations and implementing the future enhancements, our goal is to develop a more adaptable and efficient solution that can handle complex pattern-matching scenarios across various data processing environments. Future work will focus on addressing current limitations through enhanced SQL clause support, distributed processing capabilities, and advanced analytics integration. This development roadmap outlined above provides a clear path for improving performance for the current implementation.

📚 References

🤝 Contributing

Pull requests and feedback are welcome! Please ensure your code is tested and documented.

📝 License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.7

Sep 24, 2025

0.1.6

Sep 24, 2025

0.1.5

Sep 24, 2025

0.1.4

Sep 24, 2025

0.1.3

Sep 24, 2025

0.1.2

Sep 24, 2025

This version

0.1.0

Sep 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_match_recognize-0.1.0.tar.gz (2.5 MB view details)

Uploaded Sep 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandas_match_recognize-0.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Sep 24, 2025 Python 3

File details

Details for the file pandas_match_recognize-0.1.0.tar.gz.

File metadata

Download URL: pandas_match_recognize-0.1.0.tar.gz
Upload date: Sep 24, 2025
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for pandas_match_recognize-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`df58e2ef2a6a8d65c9096daa187f02f059883d8d0e0abcbf0c2fff3ad684c1fd`
MD5	`ea686f0399d949a83c6ab5a15c2a4a2a`
BLAKE2b-256	`c04c1048f2ec966640a2acd80276329a3bb1efdefb1b9d20109b7ef6db2bb83f`

See more details on using hashes here.

File details

Details for the file pandas_match_recognize-0.1.0-py3-none-any.whl.

File metadata

Download URL: pandas_match_recognize-0.1.0-py3-none-any.whl
Upload date: Sep 24, 2025
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for pandas_match_recognize-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e5105f384a1eaac0505213714a85a99e83f310a15f84c2142d5d36e92b175bb`
MD5	`d2218ede999ef3d28f90e6fda3f231fe`
BLAKE2b-256	`fb011983410be45118888a05ce1c61bdd9c1505d8fb7c3b774d414ff9c496319`

See more details on using hashes here.

pandas-match-recognize 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SQL MATCH_RECOGNIZE on Pandas

Overview

Motivation

Key Features

Architecture

Example SQL Query

🚀 Installation

Prerequisites

📦 Install from Source (Development)

📚 Install from PyPI (Production)

🧪 Install from TestPyPI (Latest Development)

📥 Install from Local Build

🔧 Verify Installation

💡 Quick Start Usage

Customer Order Pattern Analysis

🛠 Development Setup

For Contributors

🗑 Uninstallation

📚 Uninstall PyPI Installation:

🧪 Uninstall TestPyPI Installation:

📥 Uninstall Local Build Installation:

📦 Remove Editable/Development Installation:

🧹 Complete Cleanup:

Verify uninstallation:

📋 Troubleshooting

Common Issues

Conclusion and Future Work

Current Limitations

Future Work

Conclusion

📚 References

🤝 Contributing

📝 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes