The project provides a tool/library implementing an automated regular expression building mechanism.
Project description
Regex-learner
This project provides a tool/library implementing an automated regular expression building mechanism.
This project takes inspiration on the paper from Ilyas, et al [1]
This repository contains code and examples to assist in the exeuction of regular expression learning from the columns of data.
This is a basic readme. It will be completed as the prototype grows.
Installation
The project can be installed via pip:
pip install regex-learner
Examples of usage
Example of learning a date pattern from 100 examples of randomly sampled dates in the format DD-MM-YYYY.
from xsystem import XTructure
from faker import Faker
fake = Faker()
x = XTructure() # Create basic XTructure class
for _ in range(100):
d = fake.date(pattern=r"%d-%m-%Y") # Create example of data - date in the format DD-MM-YYYY
x.learn_new_word(d) # Add example to XSystem and learn new features
print(str(x)) # ([0312][0-9])(-)([01][891652073])(-)([21][09][078912][0-9])
Similary, the tool can be used directly from the command line using the regex-learner CLI provided by the installation of the package.
The tool has several options, as described by the help message:
> regex-learner -h
usage: regex-learner [-h] [-i INPUT] [-o OUTPUT] [--max-branch MAX_BRANCH] [--alpha ALPHA] [--branch-threshold BRANCH_THRESHOLD]
A simple tool to learn human readable a regular expression from examples
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Path to the input source, defaults to stdin
-o OUTPUT, --output OUTPUT
Path to the output file, defaults to stdout
--max-branch MAX_BRANCH
Maximum number of branches allowed, defaults to 8
--alpha ALPHA Weight for fitting tuples, defaults to 1/5
--branch-threshold BRANCH_THRESHOLD
Branching threshold, defaults to 0.85, relative to the fitting score alpha
Assuming a data file containing the examples to learn from is called EXAMPLE_FILE, and assuming one is interested in a very simple regular expression, the tool can be used as follows:
cat EXAMPLE_FILE | regex-learner --max-branch 2
Note
Note that this project is not based on the actual implementation of the paper as presented in [2]
References
- Ilyas, Andrew, et al. "Extracting syntactical patterns from databases." 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018.
- https://github.com/mitdbg/XSystem
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for regex_learner-0.0.4-py2.py3-none-any.whl
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 | 2a46f7983421a73faf2d80de1a57a1100b34126b043a8a029d516e9451fe86e0 |
|
| MD5 | 6ba07de961480c7c6ccb9cb7d8068736 |
|
| BLAKE2b-256 | 015e78c0f07a08f285f3fdadf55288ea30966c61006970a64ffabedf695abc60 |