Regular Expression Builder
Project description
regularize - Easily compose regular expressions
Motivation
Writing complex regular expressions can prove to be difficult and error-prone. This library aims to provide a simple interface for constructing patterns, finding matches and performing substitutions.
Key Features
- Pattern Builder: a clean and robust API to build complex regular expressions.
- Flag Interface: easily add and remove flags using a friendly interface.
- Immutable Pattern Objects: in order to increase composability and reusability,
Patterninstances do not modify internal state, but instead return copies with the modifications. - Find/Replace with LRU cache: using a shared cache, different pattern instances that compile to the same regular expression can benefit from the same cache entries.
Examples
Match compressed / uncompressed log filenames
from regularize import pattern, finder
# Start a new pattern
logfile_pattern = pattern()
# Create a base pattern for the logfile names
logfile_pattern = logfile_pattern.\
literal('application.').\
any_number().\
quantify(minimum=1).case_insensitive()
uncompressed_logfile = logfile_pattern.literal('.log').end_anchor()
compressed_logfile = logfile_pattern.literal('.log.gz').end_anchor()
print(uncompressed_logfile)
# Expression: /application\.[0-9]+\.log$/
print(compressed_logfile)
# Expression: /application\.[0-9]+\.log\.gz$/
print(finder(uncompressed_logfile).match('application.1.log'))
# <re.Match object; span=(0, 17), match='application.1.log'>
print(finder(compressed_logfile).match('application.1.log.gz'))
# <re.Match object; span=(0, 20), match='application.1.log.gz'>
Match and extract URL components
from urllib.parse import urlparse
from regularize import pattern
# Valid characters for DNS names
ascii_alphanumeric = pattern().lowercase_ascii_letters(). \
uppercase_ascii_letters().any_number()
domain_pattern = \
ascii_alphanumeric.close_bracket() + \
ascii_alphanumeric.literal('-').quantify(1, 61)
# At least one alphanumeric character before the dot and after the dash
domain_pattern += ascii_alphanumeric.close_bracket()
# Add TLD
tld_pattern = pattern().lowercase_ascii_letters(closed=False). \
uppercase_ascii_letters(). \
quantify(minimum=2)
# Add optional subdomain group
subdomain_pattern = domain_pattern.\
group(name='subdomain', optional=True).\
literal('.').\
group(optional=True)
# Full domain pattern
domain_pattern = subdomain_pattern + domain_pattern.literal('.') + tld_pattern
# Match HTTP or HTTPS scheme
scheme_pattern = pattern().literal('http').any_of('s').\
quantify(minimum=0, maximum=1).\
group('scheme').\
literal('://')
# Match the URL path (if any exists)
path_pattern = pattern().literal('/').any_number().\
lowercase_ascii_letters().literal('%-_').\
quantify(minimum=1).match_all()
# Compose the complete pattern
url_pattern = (scheme_pattern + domain_pattern.group('domain') +
path_pattern.group(name='path', optional=True)).case_insensitive()
url = 'https://www.example.com/p/1'
compiled_url_pattern = url_pattern.compile()
url_regex_matches = compiled_url_pattern.match(url).groupdict()
parsed_url = urlparse(url)
print(url_regex_matches)
# {'scheme': 'https', 'domain': 'www.example.com', 'subdomain': 'www', 'path': '/p/1'}
print(parsed_url)
# ParseResult(scheme='https', netloc='www.example.com', path='/p/1', params='', query='', fragment='')
assert parsed_url.scheme == url_regex_matches['scheme']
assert parsed_url.hostname == url_regex_matches['domain']
assert parsed_url.path == url_regex_matches['path']
assert url_regex_matches['subdomain'] == 'www'
Parse HTTP Logs
The following example is taken from the common format sample of the Apache web server combined log.
from regularize.expression import Pattern, pattern
apache_webserver_combined_log = (
'127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] '
'"GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" '
'"Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0"'
)
ip = pattern().any_of('.', Pattern.ANY_NUMBER).quantify(minimum=7).group('ip')
identd_client_id = pattern().literal('-')
http_auth_user = pattern().any_of(Pattern.ANY_ASCII_CHARACTER, '_', '.').\
at_least_one().group('http_auth_user')
time = pattern().literal('[').none_of(']').quantify(minimum=26).literal(']')
http_verb = pattern().literal('"').group('http_verb',
pattern=pattern().uppercase_ascii_letters().at_least_one())
url = pattern().group(name='url',
pattern=pattern().none_of(Pattern.ANY_WHITESPACE).at_least_one())
http_version = pattern().literal('HTTP/').any_of('1', '2').literal('.').\
any_of('0', '1').group('http_version').literal('"')
http_status_code = pattern().group(name='http_status_code',
pattern=pattern().any_of(Pattern.ANY_NUMBER).exactly(3))
response_bytes = pattern().group(name='response_bytes_without_headers',
pattern=pattern().any_of(Pattern.ANY_NUMBER).at_least_one())
# Note the repetition here. For multiple groups using the same expression,
# we can create a lambda, e.g:
# lambda name: pattern().literal('"').group(name=name, pattern=pattern().none_of('"').at_least_one()).literal('"')
referer = pattern().literal('"').\
group(name='referer', pattern=pattern().none_of('"').at_least_one()).literal('"')
user_agent = pattern().literal('"').\
group(name='user_agent', pattern=pattern().none_of('"').at_least_one())
p = Pattern.join(
pattern().whitespace(),
[ip, identd_client_id, http_auth_user, time,
http_verb, url, http_version, http_status_code,
response_bytes, referer, user_agent]
)
assert {'ip': '127.0.0.1', 'http_auth_user': 'frank', 'http_verb': 'GET', 'url': '/apache_pb.gif',
'http_version': 'HTTP/1.0', 'http_status_code': '200', 'response_bytes_without_headers': '2326',
'user_agent': 'http://www.example.com/start.html'} == \
p.compile().match(apache_webserver_combined_log).groupdict()
Strip HTML tags
from regularize import pattern
from regularize.replace import substitution
html = '''<h1>Article Title</h1>
<p>This is a <b>blog post</b></p>'''
p = pattern().literal('<').any_of('/').quantify(minimum=0).ascii_letters().any_number().at_least_one().literal('>')
s = substitution(p)
text = s.replace(html)
print(text)
'''
Article Title
This is a blog post
'''
API
Pattern Builder
Finder
Substitution (Replace)
Extending
Writing Extensions
Commonly used patterns can be easily added either by creating a sub-class of the Pattern class,
or by using the extension registry.
Using a Pattern sub-class
There are two prerequisites for new pattern builder methods:
- The return value should be a
Patterninstance. - Internal state is not modified, but instead all changes are applied to an instance clone.
from regularize.expression import Pattern
class MyPattern(Pattern):
def html_tag(self, opening=True):
if opening:
new = self.literal('<')
else:
new = self.literal('</')
return new.any_of(Pattern.ANY_ASCII_CHARACTER).at_least_one().literal('>')
Registering an extension
from regularize.expression import Pattern
class HTMLTag(Pattern):
def __call__(self, opening=True):
if opening:
new = self.literal('<')
else:
new = self.literal('</')
return new.any_of(Pattern.ANY_ASCII_CHARACTER). \
quantify(minimum=1).literal('>')
p = Pattern()
# The registry is attached to the Pattern class:
Pattern.registry.add('html_tag', HTMLTag)
# But is also accessible through the instance for convenience:
p.extensions.registry.add('html_tag', HTMLTag)
# We can now call the pattern wrapper by its given alias, through the `ext` object:
p = p.ext.html_tag()
print(p.build())
# <[a-z]+>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file regularize-0.0.5.tar.gz.
File metadata
- Download URL: regularize-0.0.5.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96dfa24a964bc51ff588e739dd9ade9fc4c3918de309a4e08a6c0981d1b0a7e8
|
|
| MD5 |
0a0f280b2893aa34c33e9b96292e02ca
|
|
| BLAKE2b-256 |
02d1f921141dec8642ed18511d2cf5bbef5b08430e1c587613d143a4840bc1eb
|
File details
Details for the file regularize-0.0.5-py3-none-any.whl.
File metadata
- Download URL: regularize-0.0.5-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11990d403a09f4d159cc7bb09d9349d3f90c8090e506d9281fa985cb3deeae4d
|
|
| MD5 |
09646c4e71319f014460817b8c863e6f
|
|
| BLAKE2b-256 |
8f6f1573e5978c04ea14080515cca9c92b85812bba76a576ac95784a7fffce6c
|