Regular Expression Builder

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
- System Administrators
License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3

Project description

regularize - Easily compose regular expressions

Github Actions Status

Motivation

Writing complex regular expressions can prove to be difficult and error-prone. This library aims to provide a simple interface for constructing patterns, finding matches and performing substitutions.

Key Features

Pattern Builder: a clean and robust API to build complex regular expressions.
Flag Interface: easily add and remove flags using a friendly interface.
Immutable Pattern Objects: in order to increase composability and reusability, Pattern instances do not modify internal state, but instead return copies with the modifications.
Find/Replace with LRU cache: using a shared cache, different pattern instances that compile to the same regular expression can benefit from the same cache entries.

Examples

Match compressed / uncompressed log filenames

from regularize import pattern, finder

# Start a new pattern
logfile_pattern = pattern()

# Create a base pattern for the logfile names
logfile_pattern = logfile_pattern.\
                  literal('application.').\
                  any_number().\
                  quantify(minimum=1).case_insensitive()

uncompressed_logfile = logfile_pattern.literal('.log').end_anchor()
compressed_logfile = logfile_pattern.literal('.log.gz').end_anchor()

print(uncompressed_logfile)
# Expression: /application\.[0-9]+\.log$/

print(compressed_logfile)
# Expression: /application\.[0-9]+\.log\.gz$/

print(finder(uncompressed_logfile).match('application.1.log'))
# <re.Match object; span=(0, 17), match='application.1.log'>
print(finder(compressed_logfile).match('application.1.log.gz'))
# <re.Match object; span=(0, 20), match='application.1.log.gz'>

Match and extract URL components

from urllib.parse import urlparse

from regularize import pattern

# Valid characters for DNS names
ascii_alphanumeric = pattern().lowercase_ascii_letters(). \
    uppercase_ascii_letters().any_number()

domain_pattern = \
    ascii_alphanumeric.close_bracket() + \
    ascii_alphanumeric.literal('-').quantify(1, 61)

# At least one alphanumeric character before the dot and after the dash
domain_pattern += ascii_alphanumeric.close_bracket()

# Add TLD
tld_pattern = pattern().lowercase_ascii_letters(closed=False). \
    uppercase_ascii_letters(). \
    quantify(minimum=2)

# Add optional subdomain group
subdomain_pattern = domain_pattern.\
    group(name='subdomain', optional=True).\
    literal('.').\
    group(optional=True)

# Full domain pattern
domain_pattern = subdomain_pattern + domain_pattern.literal('.') + tld_pattern

# Match HTTP or HTTPS scheme
scheme_pattern = pattern().literal('http').any_of('s').\
    quantify(minimum=0, maximum=1).\
    group('scheme').\
    literal('://')

# Match the URL path (if any exists)
path_pattern = pattern().literal('/').any_number().\
    lowercase_ascii_letters().literal('%-_').\
    quantify(minimum=1).match_all()

# Compose the complete pattern
url_pattern = (scheme_pattern + domain_pattern.group('domain') +
               path_pattern.group(name='path', optional=True)).case_insensitive()

url = 'https://www.example.com/p/1'

compiled_url_pattern = url_pattern.compile()
url_regex_matches = compiled_url_pattern.match(url).groupdict()

parsed_url = urlparse(url)

print(url_regex_matches)
# {'scheme': 'https', 'domain': 'www.example.com', 'subdomain': 'www', 'path': '/p/1'}
print(parsed_url)
# ParseResult(scheme='https', netloc='www.example.com', path='/p/1', params='', query='', fragment='')
assert parsed_url.scheme == url_regex_matches['scheme']
assert parsed_url.hostname == url_regex_matches['domain']
assert parsed_url.path == url_regex_matches['path']
assert url_regex_matches['subdomain'] == 'www'

Parse HTTP Logs

The following example is taken from the common format sample of the Apache web server combined log.

from regularize.expression import Pattern, pattern

apache_webserver_combined_log = (
    '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] '
    '"GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" '
    '"Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0"'
)

ip = pattern().any_of('.', Pattern.ANY_NUMBER).quantify(minimum=7).group('ip')
identd_client_id = pattern().literal('-')
http_auth_user = pattern().any_of(Pattern.ANY_ASCII_CHARACTER, '_', '.').\
    at_least_one().group('http_auth_user')
time = pattern().literal('[').none_of(']').quantify(minimum=26).literal(']')
http_verb = pattern().literal('"').group('http_verb',
                                         pattern=pattern().uppercase_ascii_letters().at_least_one())
url = pattern().group(name='url',
                      pattern=pattern().none_of(Pattern.ANY_WHITESPACE).at_least_one())
http_version = pattern().literal('HTTP/').any_of('1', '2').literal('.').\
    any_of('0', '1').group('http_version').literal('"')
http_status_code = pattern().group(name='http_status_code',
                                   pattern=pattern().any_of(Pattern.ANY_NUMBER).exactly(3))
response_bytes = pattern().group(name='response_bytes_without_headers',
                                 pattern=pattern().any_of(Pattern.ANY_NUMBER).at_least_one())
# Note the repetition here. For multiple groups using the same expression,
# we can create a lambda, e.g:
# lambda name: pattern().literal('"').group(name=name, pattern=pattern().none_of('"').at_least_one()).literal('"')
referer = pattern().literal('"').\
    group(name='referer', pattern=pattern().none_of('"').at_least_one()).literal('"')
user_agent = pattern().literal('"').\
    group(name='user_agent', pattern=pattern().none_of('"').at_least_one())

p = Pattern.join(
    pattern().whitespace(),
    [ip, identd_client_id, http_auth_user, time,
     http_verb, url, http_version, http_status_code,
     response_bytes, referer, user_agent]
)
assert {'ip': '127.0.0.1', 'http_auth_user': 'frank', 'http_verb': 'GET', 'url': '/apache_pb.gif',
             'http_version': 'HTTP/1.0', 'http_status_code': '200', 'response_bytes_without_headers': '2326',
             'user_agent': 'http://www.example.com/start.html'} == \
        p.compile().match(apache_webserver_combined_log).groupdict()

Strip HTML tags

from regularize import pattern
from regularize.replace import substitution

html = '''<h1>Article Title</h1>
<p>This is a <b>blog post</b></p>'''
p = pattern().literal('<').any_of('/').quantify(minimum=0).ascii_letters().any_number().at_least_one().literal('>')
s = substitution(p)
text = s.replace(html)
print(text)
'''
Article Title
    This is a blog post
'''

API

Pattern Builder

Finder

Substitution (Replace)

Extending

Writing Extensions

Commonly used patterns can be easily added either by creating a sub-class of the Pattern class, or by using the extension registry.

Using a Pattern sub-class

There are two prerequisites for new pattern builder methods:

The return value should be a Pattern instance.
Internal state is not modified, but instead all changes are applied to an instance clone.

from regularize.expression import Pattern

class MyPattern(Pattern):
    def html_tag(self, opening=True):
        if opening:
            new = self.literal('<')
        else:
            new = self.literal('</')
        return new.any_of(Pattern.ANY_ASCII_CHARACTER).at_least_one().literal('>')

Registering an extension

from regularize.expression import Pattern

class HTMLTag(Pattern):
    def __call__(self, opening=True):
        if opening:
            new = self.literal('<')
        else:
            new = self.literal('</')
        return new.any_of(Pattern.ANY_ASCII_CHARACTER). \
            quantify(minimum=1).literal('>')


p = Pattern()
# The registry is attached to the Pattern class:
Pattern.registry.add('html_tag', HTMLTag)
# But is also accessible through the instance for convenience:
p.extensions.registry.add('html_tag', HTMLTag)
# We can now call the pattern wrapper by its given alias, through the `ext` object:
p = p.ext.html_tag()

print(p.build())
# <[a-z]+>

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
- System Administrators
License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.0.7

Mar 28, 2021

This version

0.0.5

Mar 26, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

regularize-0.0.5.tar.gz (12.6 kB view details)

Uploaded Mar 26, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

regularize-0.0.5-py3-none-any.whl (11.2 kB view details)

Uploaded Mar 26, 2021 Python 3

File details

Details for the file regularize-0.0.5.tar.gz.

File metadata

Download URL: regularize-0.0.5.tar.gz
Upload date: Mar 26, 2021
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for regularize-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`96dfa24a964bc51ff588e739dd9ade9fc4c3918de309a4e08a6c0981d1b0a7e8`
MD5	`0a0f280b2893aa34c33e9b96292e02ca`
BLAKE2b-256	`02d1f921141dec8642ed18511d2cf5bbef5b08430e1c587613d143a4840bc1eb`

See more details on using hashes here.

File details

Details for the file regularize-0.0.5-py3-none-any.whl.

File metadata

Download URL: regularize-0.0.5-py3-none-any.whl
Upload date: Mar 26, 2021
Size: 11.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for regularize-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11990d403a09f4d159cc7bb09d9349d3f90c8090e506d9281fa985cb3deeae4d`
MD5	`09646c4e71319f014460817b8c863e6f`
BLAKE2b-256	`8f6f1573e5978c04ea14080515cca9c92b85812bba76a576ac95784a7fffce6c`

See more details on using hashes here.

regularize 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

regularize - Easily compose regular expressions

Motivation

Key Features

Examples

Match compressed / uncompressed log filenames

Match and extract URL components

Parse HTTP Logs

Strip HTML tags

API

Pattern Builder

Finder

Substitution (Replace)

Extending

Writing Extensions

Using a Pattern sub-class

Registering an extension

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes