Skip to main content

The project provides a tool/library implementing an automated regular expression building mechanism.

Project description

Regex-learner

This project provides a tool/library implementing an automated regular expression building mechanism.

This project takes inspiration on the paper from Ilyas, et al [1]

Ilyas, Andrew, M. F. da Trindade, Joana, Castro Fernandez, Raul and Madden, Samuel. 2018. "Extracting Syntactical Patterns from Databases."

This repository contains code and examples to assist in the exeuction of regular expression learning from the columns of data.

This is a basic readme. It will be completed as the prototype grows.

Installation

The project can be installed via pip:

pip install regex-learner

Examples of usage

Example of learning a date pattern from 100 examples of randomly sampled dates in the format DD-MM-YYYY.

from xsystem import XTructure
from faker import Faker

fake = Faker()
x = XTructure() # Create basic XTructure class

for _ in range(100):
    d = fake.date(pattern=r"%d-%m-%Y") # Create example of data - date in the format DD-MM-YYYY
    x.learn_new_word(d) # Add example to XSystem and learn new features

print(str(x)) # ([0312][0-9])(-)([01][891652073])(-)([21][09][078912][0-9])

Similary, the tool can be used directly from the command line using the regex-learner CLI provided by the installation of the package.

The tool has several options, as described by the help message:

> regex-learner -h
usage: regex-learner [-h] [-i INPUT] [-o OUTPUT] [--max-branch MAX_BRANCH] [--alpha ALPHA] [--branch-threshold BRANCH_THRESHOLD]

A simple tool to learn human readable a regular expression from examples

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Path to the input source, defaults to stdin
  -o OUTPUT, --output OUTPUT
                        Path to the output file, defaults to stdout
  --max-branch MAX_BRANCH
                        Maximum number of branches allowed, defaults to 8
  --alpha ALPHA         Weight for fitting tuples, defaults to 1/5
  --branch-threshold BRANCH_THRESHOLD
                        Branching threshold, defaults to 0.85, relative to the fitting score alpha

Assuming a data file containing the examples to learn from is called EXAMPLE_FILE, and assuming one is interested in a very simple regular expression, the tool can be used as follows:

cat EXAMPLE_FILE | regex-learner --max-branch 2

Note

Note that this project is not based on the actual implementation of the paper as presented in [2]

References

  1. Ilyas, Andrew, et al. "Extracting syntactical patterns from databases." 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018.
  2. https://github.com/mitdbg/XSystem

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

regex-learner-0.0.4.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

regex_learner-0.0.4-py2.py3-none-any.whl (10.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file regex-learner-0.0.4.tar.gz.

File metadata

  • Download URL: regex-learner-0.0.4.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for regex-learner-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f92d9d918616bcf360f64aecb0384c39701886c514f4d548735aa31497a4bee8
MD5 d3b376a32ec88ed598e17ec872a963dc
BLAKE2b-256 d14fa0f85e09fdfa431080d97949a2e71e26e2a01a08b48eb684d06726adfce3

See more details on using hashes here.

File details

Details for the file regex_learner-0.0.4-py2.py3-none-any.whl.

File metadata

  • Download URL: regex_learner-0.0.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for regex_learner-0.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2a46f7983421a73faf2d80de1a57a1100b34126b043a8a029d516e9451fe86e0
MD5 6ba07de961480c7c6ccb9cb7d8068736
BLAKE2b-256 015e78c0f07a08f285f3fdadf55288ea30966c61006970a64ffabedf695abc60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page