Tools for creating RE and RE2 expressions
Project description
Regex-Toolkit
Regex-Toolkit provides tools for creating RE and RE2 expressions.
Requirements:
Regex-Toolkit requires Python 3.10 or higher, is platform independent, and has no outside dependencies.
Issue reporting
If you discover an issue with Regex-Toolkit, please report it at https://github.com/Phosmic/regex-toolkit/issues.
License
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Requirements Installing Usage Library
Installing
Most stable version from PyPi:
python3 -m pip install regex-toolkit
Development version from GitHub:
git clone git+https://github.com/Phosmic/regex-toolkit.git
cd regex-toolkit
python3 -m pip install -e .
Usage
To harness the toolkit's capabilities, you should import the necessary packages:
import re
import re2
import regex_toolkit as rtk
Why Use regex_toolkit
?
Regex definitions vary across languages and versions. By using the toolkit, you can achieve a more consistent and comprehensive representation of unicode support. It is especially useful to supplement base unicode sets with the latest definitions from other languages and standards.
RE2 Overview
RE2 focuses on safely processing regular expressions, particularly from untrusted inputs. It ensures both linear match time and efficient memory usage. Although it might not always surpass other engines in speed, it intentionally omits features that depend solely on backtracking, like backreferences and look-around assertions.
A brief rundown of RE2 terminology:
- BitState: An execution engine that uses backtracking search.
- bytecode: The set of instructions that form an automaton.
- DFA: The engine for Deterministic Finite Automaton searches.
- NFA: Implements the Nondeterministic Finite Automaton search method.
- OnePass: A one-pass search execution engine.
- pattern: The textual form of a regex.
- Prog: The compiled version of a regex.
- Regexp: The parsed version of a regex.
- Rune: A character in terms of encoding, essentially a code point.
For an in-depth exploration, please refer to the RE2 documentation.
Library
regex_toolkit.utils
ord_to_cpoint
def ord_to_cpoint(ordinal: int, *, zfill: int | None = 8) -> str
Character ordinal to character codepoint.
Produces a hexadecimal ([0-9A-F]
) representation of the ordinal.
The default zero-padding is 8 characters, which is the maximum amount of characters in a codepoint.
Example:
import regex_toolkit as rtk
rtk.ord_to_cpoint(128054)
# Output: '0001F436'
# Disable zero-padding by setting `zfill` to `0` or `None`.
rtk.ord_to_cpoint(128054, zfill=0)
# Output: '1F436'
Arguments:
ordinal
int - Character ordinal.zfill
int | None, optional - Amount of characters to zero-pad the codepoint to. Defaults to 8.
Returns:
- str - Character codepoint.
cpoint_to_ord
def cpoint_to_ord(cpoint: str) -> int
Character codepoint to character ordinal.
Example:
import regex_toolkit as rtk
rtk.cpoint_to_ord("0001F436")
# Output: 128054
rtk.cpoint_to_ord("1f436")
# Output: 128054
Arguments:
cpoint
str - Character codepoint.
Returns:
- int - Character ordinal.
char_to_cpoint
def char_to_cpoint(char: str, *, zfill: int | None = 8) -> str
Character to character codepoint.
Produces a hexadecimal ([0-9A-F]
) representation of the character.
The default zero-padding is 8 characters, which is the maximum amount of characters in a codepoint.
Example:
import regex_toolkit as rtk
rtk.char_to_cpoint("🐶")
# Output: '0001F436'
# Disable zero-padding by setting `zfill` to `0` or `None`.
rtk.char_to_cpoint("🐶", zfill=0)
# Output: '1F436'
Arguments:
char
str - Character.zfill
int | None, optional - Amount of characters to zero-pad the codepoint to. Defaults to 8.
Returns:
- str - Character codepoint.
to_nfc
def to_nfc(text: str) -> str
Normalize a Unicode string to NFC form C.
Form C favors the use of a fully combined character.
Example:
import regex_toolkit as rtk
rtk.to_nfc("é")
# Output: 'é'
Arguments:
text
str - String to normalize.
Returns:
- str - Normalized string.
iter_char_range
def iter_char_range(first_char: str,
last_char: str) -> Generator[str, None, None]
Iterate all characters within a range of characters (inclusive).
Example:
import regex_toolkit as rtk
tuple(rtk.iter_char_range("a", "c"))
# Output: ('a', 'b', 'c')
tuple(rtk.iter_char_range("c", "a"))
# Output: ('c', 'b', 'a')
tuple(rtk.iter_char_range("🐶", "🐺"))
# Output: ("🐶", "🐷", "🐸", "🐹", "🐺")
Arguments:
first_char
str - Starting (first) character.last_char
str - Ending (last) character.
Yields:
- str - Characters within a range of characters.
char_range
def char_range(first_char: str, last_char: str) -> tuple[str, ...]
Get all characters within a range of characters (inclusive).
Example:
import regex_toolkit as rtk
rtk.char_range("a", "d")
# Output: ('a', 'b', 'c', 'd')
rtk.char_range("d", "a")
# Output: ('d', 'c', 'b', 'a')
rtk.char_range("🐶", "🐺")
# Output: ("🐶", "🐷", "🐸", "🐹", "🐺")
Arguments:
first_char
str - First character (inclusive).last_char
str - Last character (inclusive).
Returns:
- tuple[str, ...] - Characters within a range of characters.
mask_span
def mask_span(text: str, span: Sequence[int], mask: str | None = None) -> str
Slice and mask a string using a single span.
Example:
import regex_toolkit as rtk
rtk.mask_span("example", (0, 2))
# Output: 'ample'
rtk.mask_span("This is a example", (10, 10), "insert ")
# Output: 'This is a insert example'
rtk.mask_span("This is a example", (5, 7), "replaces part of")
# Output: 'This replaces part of a example'
Todo:
- Consider alternate behavior for a span that is out of bounds.
Arguments:
text
str - String to slice.span
Sequence[int] - Span to slice (start is inclusive, end is exclusive).mask
str, optional - String to replace the span with. Defaults to None.
Returns:
- str - String with span replaced with the mask text.
mask_spans
def mask_spans(text: str,
spans: Sequence[Sequence[int]],
masks: Sequence[str] | None = None) -> str
Slice and mask a string using multiple spans.
Example:
import regex_toolkit as rtk
rtk.mask_spans(
text="This is a example",
masks=["replaces part of", "insert "],
spans=[(5, 7), (10, 10)],
)
# Output: 'This replaces part of a insert example'
Todo:
- Consider alternate behavior for spans that overlap.
- Consider alternate behavior for spans that are out of order.
- Consider alternate behavior for spans that are out of bounds.
Arguments:
text
str - String to slice.spans
Sequence[Sequence[int]] - Spans to slice (start is inclusive, end is exclusive).masks
Sequence[str], optional - Strings to replace the spans with. Defaults to None.
Returns:
- str - String with all spans replaced with the mask text.
regex_toolkit.base
escape
def escape(char: str, flavor: int | None = None) -> str
Create a regex expression that exactly matches a character.
Example:
import regex_toolkit as rtk
rtk.escape("a")
# Output: 'a'
rtk.escape(".")
# Output: '\.'
rtk.escape("/")
# Output: '/'
rtk.escape(".", flavor=2)
# Output: '\.'
rtk.escape("a", flavor=2)
# Output: 'a'
rtk.escape("/", flavor=2)
# Output: '\x{002f}'
Arguments:
char
str - Character to match.flavor
int | None, optional - Regex flavor (1 for RE, 2 for RE2). Defaults to None.
Returns:
- str - Expression that exactly matches the original character.
Raises:
ValueError
- Invalid regex flavor.TypeError
- Invalid type forchar
.
string_as_exp
def string_as_exp(text: str, flavor: int | None = None) -> str
Create a regex expression that exactly matches a string.
Example:
import regex_toolkit as rtk
rtk.string_as_exp("http://www.example.com")
# Output: 'https\:\/\/example\.com'
rtk.string_as_exp("http://www.example.com", flavor=2)
# Output: 'https\x{003a}\x{002f}\x{002f}example\.com'
Arguments:
text
str - String to match.flavor
int | None, optional - Regex flavor (1 for RE, 2 for RE2). Defaults to None.
Returns:
- str - Expression that exactly matches the original string.
Raises:
ValueError
- Invalid regex flavor.
strings_as_exp
def strings_as_exp(texts: Iterable[str], flavor: int | None = None) -> str
Create a regex expression that exactly matches any one string.
Example:
import regex_toolkit as rtk
rtk.strings_as_exp(["apple", "banana", "cherry"])
# Output: 'banana|cherry|apple'
rtk.strings_as_exp(["apple", "banana", "cherry"], flavor=2)
# Output: 'banana|cherry|apple'
Arguments:
texts
Iterable[str] - Strings to match.flavor
int | None, optional - Regex flavor (1 for RE, 2 for RE2). Defaults to None.
Returns:
- str - Expression that exactly matches any one of the original strings.
Raises:
ValueError
- Invalid regex flavor.
make_exp
def make_exp(chars: Iterable[str], flavor: int | None = None) -> str
Create a regex expression that exactly matches a list of characters.
The characters are sorted and grouped into ranges where possible. The expression is not anchored, so it can be used as part of a larger expression.
Example:
import regex_toolkit as rtk
"[" + rtk.make_exp(["a", "b", "c", "z", "y", "x"]) + "]"
# Output: '[a-cx-z]'
"[" + rtk.make_exp(["a", "b", "c", "z", "y", "x"], flavor=2) + "]"
# Output: '[a-cx-z]'
Arguments:
chars
Iterable[str] - Characters to match.flavor
int | None, optional - Regex flavor (1 for RE, 2 for RE2). Defaults to None.
Returns:
- str - Expression that exactly matches the original characters.
Raises:
ValueError
- Invalid regex flavor.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file regex_toolkit-0.1.0.tar.gz
.
File metadata
- Download URL: regex_toolkit-0.1.0.tar.gz
- Upload date:
- Size: 55.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ab98648d0fb9c38348eead028e5571f6cf9a61a5d6b6be39f9366ade4f0afd7 |
|
MD5 | df219d3743f0101ee4c3dbc25ac7e583 |
|
BLAKE2b-256 | cd4719198088934cd134cbac41a797f2f2d5a785d424676be750d220693bbfdf |
File details
Details for the file regex_toolkit-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: regex_toolkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88e15d38e621f337e1cb1c60923d3c03c0a8f9d310ab249ea49b134f44628d79 |
|
MD5 | 9f3e8fce86c02295c09132d819403dff |
|
BLAKE2b-256 | 4ed7eae618d729db8cb2ade9f2b33e84d1b24441babc5ee469747adea5adae4f |