The project provides a tool/library implementing an automated regular expression building mechanism.
Project description
Regex-learner
This project provides a tool/library implementing an automated regular expression building mechanism.
This project takes inspiration on the paper from Ilyas, et al [1]
This repository contains code and examples to assist in the exeuction of regular expression learning from the columns of data.
This is a basic readme. It will be completed as the prototype grows.
Installation
The project can be installed via pip:
pip install regex-learner
Examples of usage
Example of learning a date pattern from 100 examples of randomly sampled dates in the format DD-MM-YYYY.
from xsystem import XTructure
from faker import Faker
fake = Faker()
x = XTructure() # Create basic XTructure class
for _ in range(100):
d = fake.date(pattern=r"%d-%m-%Y") # Create example of data - date in the format DD-MM-YYYY
x.learn_new_word(d) # Add example to XSystem and learn new features
print(str(x)) # ([0312][0-9])(-)([01][891652073])(-)([21][09][078912][0-9])
Similary, the tool can be used directly from the command line using the regex-learner CLI provided by the installation of the package.
The tool has several options, as described by the help message:
> regex-learner -h
usage: regex-learner [-h] [-i INPUT] [-o OUTPUT] [--max-branch MAX_BRANCH] [--alpha ALPHA] [--branch-threshold BRANCH_THRESHOLD]
A simple tool to learn human readable a regular expression from examples
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Path to the input source, defaults to stdin
-o OUTPUT, --output OUTPUT
Path to the output file, defaults to stdout
--max-branch MAX_BRANCH
Maximum number of branches allowed, defaults to 8
--alpha ALPHA Weight for fitting tuples, defaults to 1/5
--branch-threshold BRANCH_THRESHOLD
Branching threshold, defaults to 0.85, relative to the fitting score alpha
Assuming a data file containing the examples to learn from is called EXAMPLE_FILE, and assuming one is interested in a very simple regular expression, the tool can be used as follows:
cat EXAMPLE_FILE | regex-learner --max-branch 2
Note
Note that this project is not based on the actual implementation of the paper as presented in [2]
References
- Ilyas, Andrew, et al. "Extracting syntactical patterns from databases." 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018.
- https://github.com/mitdbg/XSystem
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file regex-learner-0.0.4.tar.gz.
File metadata
- Download URL: regex-learner-0.0.4.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f92d9d918616bcf360f64aecb0384c39701886c514f4d548735aa31497a4bee8
|
|
| MD5 |
d3b376a32ec88ed598e17ec872a963dc
|
|
| BLAKE2b-256 |
d14fa0f85e09fdfa431080d97949a2e71e26e2a01a08b48eb684d06726adfce3
|
File details
Details for the file regex_learner-0.0.4-py2.py3-none-any.whl.
File metadata
- Download URL: regex_learner-0.0.4-py2.py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a46f7983421a73faf2d80de1a57a1100b34126b043a8a029d516e9451fe86e0
|
|
| MD5 |
6ba07de961480c7c6ccb9cb7d8068736
|
|
| BLAKE2b-256 |
015e78c0f07a08f285f3fdadf55288ea30966c61006970a64ffabedf695abc60
|