Regex to C code generator
Project description
emmtrix Regex-to-C Code Generator (emx-regex-cgen) compiles regular expressions into portable, static C code for embedded and performance-critical applications.
Key Characteristics
- Two backend engines — choose between a table-driven DFA and a
bit-parallel NFA (
--engine dfa,--engine bitnfa, or the default--engine autowhich tries DFA first and falls back to bitnfa when the DFA state limit is exceeded). - Table-driven DFA — every regex is compiled to a minimised deterministic finite automaton; the generated C code performs a single linear scan over the input.
- Bit-parallel NFA — the regex is compiled to a Thompson NFA and
simulated with precomputed bitmasks; the hot loop is fully unrolled
with no loops over words. The variant (
uint8,uint16,uint32, oruint32_t[N]) is selected automatically based on the number of NFA positions. - No dynamic memory allocation — all data is
static const; nomalloc, nofree. - Branch-free inner loop — the matching loop contains a single table lookup per byte (DFA) or unrolled bitwise operations (bitnfa).
- re2 feature set — supports the same subset of regular expressions as Google's re2 library (no back-references, no look-around).
- Fullmatch semantics — the generated function checks whether the entire input matches the pattern.
Installation
pip install emx-regex-cgen
The distribution name is emx-regex-cgen. The Python import path is
emx_regex_cgen.
After installation, you can use the package either as a CLI tool via
emx-regex-cgen or as a Python library via
from emx_regex_cgen import generate.
Quick Start
Python Library
generate() returns a GeneratedCode object with four parts.
Call .render() to get the combined C source string, or access individual
fields to embed them in an existing code-generation pipeline.
from emx_regex_cgen import generate
# generate() returns a GeneratedCode object
result = generate(r"\d{4}-\d{2}-\d{2}")
# .render() assembles the complete C source (same output as before)
print(result.render())
# Access individual parts:
result.includes # ['stddef.h', 'stdbool.h', 'stdint.h']
result.globals # static const transition table (and optional maps)
result.match_function # bool regex_match(const char *input, size_t len) { … }
result.main_function # int main(…) { … } — None when emit_main=False
# Include a main() for standalone testing
result = generate(r"[a-z]+", emit_main=True)
# Bytes mode: '.' matches any single byte, classes work on raw byte values
result = generate(r"[\x80-\xff]+", encoding="bytes")
# Use the bit-parallel NFA backend instead of DFA
result = generate(r"hello", engine="bitnfa")
generate() options
generate(
pattern,
flags="",
*,
emit_main=False,
prefix="regex",
encoding="utf8",
engine="auto",
row_dedup="auto",
alphabet_compression="auto",
size_threshold=8192,
early_exit=False,
)
Library arguments map directly to the generator settings:
| Argument | Type | Default | Meaning |
|---|---|---|---|
pattern |
str |
- | Regular expression to compile. |
flags |
str |
"" |
Regex flags: i (ignore case), s (dot-all), m (multiline), x (verbose syntax accepted for compatibility). |
emit_main |
bool |
False |
Include a standalone main() function in the generated C code. |
prefix |
str |
"regex" |
Prefix for generated C identifiers such as regex_match. |
encoding |
str |
"utf8" |
"utf8" for Unicode-aware UTF-8 input, or "bytes" for raw byte semantics. |
engine |
str |
"auto" |
Backend engine: "auto" (DFA first, bitnfa fallback), "dfa", or "bitnfa". |
row_dedup |
str |
"auto" |
DFA only: "yes", "no", or "auto" for transition-row deduplication. |
alphabet_compression |
str |
"auto" |
DFA only: "yes", "no", or "auto" for byte equivalence-class compression. |
size_threshold |
int |
8192 |
DFA only: threshold used by "auto" for row_dedup and alphabet_compression. |
early_exit |
bool |
False |
DFA only: break out of the loop once the dead state is reached. |
Example with multiple options:
from emx_regex_cgen import generate
result = generate(
r"[a-z]+\d+",
flags="i",
prefix="user_id",
emit_main=True,
engine="dfa",
encoding="utf8",
row_dedup="yes",
alphabet_compression="auto",
size_threshold=4096,
early_exit=True,
)
print(result.render())
CLI
# Write generated C code to stdout
emx-regex-cgen '[a-z]+\d+'
# Write to a file with a main() function
emx-regex-cgen '[a-z]+\d+' --emit-main -o matcher.c
# Compile and test
gcc -O2 -o matcher matcher.c
./matcher "hello42" # exit 0 (match)
./matcher "HELLO" # exit 1 (no match)
# Bytes mode: match any sequence of high bytes
emx-regex-cgen --encoding bytes '[\x80-\xff]+' --emit-main -o byte_matcher.c
# Use the bit-parallel NFA backend
emx-regex-cgen --engine bitnfa 'hello' --emit-main -o bitnfa_matcher.c
CLI Reference
usage: emx-regex-cgen [-h] [-o OUTPUT] [--emit-main] [--prefix PREFIX]
[--flags FLAGS] [--encoding {utf8,bytes}]
[--engine {auto,dfa,bitnfa}]
[--row-dedup {yes,no,auto}]
[--alphabet-compression {yes,no,auto}]
[--size-threshold SIZE_THRESHOLD]
[--early-exit {yes,no}] pattern
Generate C code that performs a fullmatch for a regular expression.
positional arguments:
pattern Regular expression pattern
options:
-o, --output Output file (default: stdout)
--emit-main Also emit a main() function (exit 0=match, 1=no match, 2=error)
--prefix PREFIX Prefix for all generated C identifiers (default: regex)
--flags FLAGS Regex flags: i (case-insensitive), s (dot-all), m (multiline)
--encoding {utf8,bytes}
Input encoding: utf8 (default, Unicode-aware) or bytes
(raw byte semantics)
--engine {auto,dfa,bitnfa}
Backend engine: auto (try DFA first, fall back to bitnfa; default),
dfa (table-driven minimised DFA),
bitnfa (bit-parallel NFA)
--row-dedup {yes,no,auto}
Transition-row deduplication: yes (always), no (never),
auto (when table exceeds --size-threshold; default). DFA only.
--alphabet-compression {yes,no,auto}
Alphabet compression into equivalence classes: yes (always),
no (never), auto (when table exceeds --size-threshold; default).
DFA only.
--size-threshold N Table-size threshold (cells = states × 256) for auto mode
(default: 8192). DFA only.
--early-exit {yes,no}
Emit early-exit check in DFA loop: yes (break when dead state
is reached), no (always process full input; default). DFA only.
Supported Features
Regex Features
CLI Options
| Option | Effect | Golden reference |
|---|---|---|
--flags i |
Case-insensitive matching | flag_ignorecase.c · bitnfa |
--flags s |
Dot matches \n (dot-all) |
flag_dotall.c · bitnfa |
--flags m |
Multiline anchors | flag_multiline.c · bitnfa |
--flags x |
Verbose / free-spacing mode | flag_verbose.c · bitnfa |
--encoding bytes |
Raw byte semantics (no UTF-8) | encoding_bytes.c · bitnfa |
--prefix NAME |
Custom identifier prefix | prefix.c · bitnfa |
--emit-main |
Include standalone main() |
emit_main.c · bitnfa |
--alphabet-compression yes |
Byte equivalence-class compression | alphabet_compression.c |
--row-dedup yes |
Transition-row deduplication | row_dedup.c |
--early-exit yes |
Break DFA loop on dead state (early exit) | early_exit.c |
Bit-NFA Variants
The bitnfa engine automatically selects the narrowest integer type
that can hold all NFA positions:
| Variant | NFA positions | Golden reference |
|---|---|---|
uint8_t |
≤ 8 | bitnfa_uint8.c |
uint16_t |
≤ 16 | bitnfa_uint16.c |
uint32_t |
≤ 32 | bitnfa_uint32.c |
uint32_t[N] |
> 32 | bitnfa_uint32_array.c |
Generated Code Structure
DFA Engine (default)
The generator produces:
- An optional alphabet map (
static const uint8_t regex_alphabet[256]) mapping each byte to its equivalence class (emitted when alphabet compression is active). - A transition table (
static const uint8_t regex_transitions[M][C]) mapping(row, column) → next_state. Dimensions depend on active optimisations: M is the number of unique rows (with row dedup) or states; C is the number of equivalence classes (with alphabet compression) or 256. - An optional row map (
static const uint8_t regex_row_map[N]) mappingstate → row index(emitted when row deduplication is active). - A match function with the signature:
bool regex_match(const char *input, size_t len);
- Optionally a
main()function for standalone executables.
Example Output (pattern hello, --alphabet-compression yes --row-dedup yes)
static const uint8_t regex_alphabet[256] = { /* byte → class */ };
static const uint8_t regex_transitions[6][5] = {
/* states 0, 6 */ { 0 },
/* state 1 */ { [2] = 4 },
...
};
static const uint8_t regex_row_map[7] = { 0, 1, 2, 3, 4, 5, 0 };
bool regex_match(const char *input, size_t len) {
uint8_t state = 1;
for (size_t i = 0; i < len; i++) {
state = regex_transitions[regex_row_map[state]][regex_alphabet[(unsigned char)input[i]]];
}
return state >= 6;
}
Bit-NFA Engine (--engine bitnfa)
The generator produces:
- A transition mask table
(
static const uint16_t regex_trans[P][256]) where P is the number of NFA positions. Each entry is a bitmask of destination positions for the given (source position, input byte) pair. The integer width (uint8_t/uint16_t/uint32_t/uint32_t[N]) is chosen automatically. - A match function with a fully unrolled inner loop — one
if (state & bit) next |= table[pos][b];per active position. - Optionally a
main()function for standalone executables.
Example Output (pattern hello, --engine bitnfa)
static const uint16_t regex_trans[10][256] = {
/* position 0 */ { ['h'] = 0x0006u },
/* position 1 */ { 0 },
/* position 2 */ { ['e'] = 0x0018u },
...
};
bool regex_match(const char *input, size_t len) {
uint16_t state = 0x0001u;
for (size_t i = 0; i < len; i++) {
unsigned char b = (unsigned char)input[i];
uint16_t next = 0;
if (state & 0x0001u) next |= regex_trans[0][b];
if (state & 0x0004u) next |= regex_trans[2][b];
if (state & 0x0010u) next |= regex_trans[4][b];
if (state & 0x0040u) next |= regex_trans[6][b];
if (state & 0x0100u) next |= regex_trans[8][b];
state = next;
}
return (state & 0x0200u) != 0;
}
Testing
Tests are parameterised from re2_compat_results.json, which contains
2 500+ patterns extracted from the PCRE2 test suite and validated against
Google re2.
# Run all tests (parallel)
pytest -n auto -q
# Run linter
ruff check src/ tests/
Test Strategy
- Generate C code from the regex pattern.
- Compile with
gcc -O2. - Execute the binary with each test subject as
argv[1]. - Compare the exit code against the expected match/no-match result.
Development
# Clone
git clone --recurse-submodules https://github.com/emmtrix/emx-regex-cgen.git
cd emx-regex-cgen
# Install
pip install emx-regex-cgen
# Or, for local development
pip install -e ".[dev]"
# Test
pytest -n auto -q
# Lint
ruff check src/ tests/
License
MIT — Copyright (c) 2026 emmtrix Technologies GmbH
Maintained by
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file emx_regex_cgen-0.2.0.tar.gz.
File metadata
- Download URL: emx_regex_cgen-0.2.0.tar.gz
- Upload date:
- Size: 158.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85d20adff0aac9d37443f4a691ca2557238303abea280ea4154ae5d82987ac4e
|
|
| MD5 |
0b92b1a1d4cf4f31e81061e3f3dd04bd
|
|
| BLAKE2b-256 |
1ce3355ff14d7da5e17ed52cb416f0c0c8be429e4a703916eb44799ad88919d3
|
Provenance
The following attestation bundles were made for emx_regex_cgen-0.2.0.tar.gz:
Publisher:
release.yml on emmtrix/emx-regex-cgen
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
emx_regex_cgen-0.2.0.tar.gz -
Subject digest:
85d20adff0aac9d37443f4a691ca2557238303abea280ea4154ae5d82987ac4e - Sigstore transparency entry: 1087349235
- Sigstore integration time:
-
Permalink:
emmtrix/emx-regex-cgen@4987e87a09e40b6f01187d568ed5c496fa3186af -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/emmtrix
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4987e87a09e40b6f01187d568ed5c496fa3186af -
Trigger Event:
release
-
Statement type:
File details
Details for the file emx_regex_cgen-0.2.0-py3-none-any.whl.
File metadata
- Download URL: emx_regex_cgen-0.2.0-py3-none-any.whl
- Upload date:
- Size: 43.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46640bb55b99fd0a37a75bf0f56b06de05e5c28023f3805c9b83908364423bcc
|
|
| MD5 |
3133acf04c7cda221fe53d6ecd6161d8
|
|
| BLAKE2b-256 |
1cef6a434f5b288d4d495179755b8eb10a6c83b9e22736f61a38fc25e6fb4c29
|
Provenance
The following attestation bundles were made for emx_regex_cgen-0.2.0-py3-none-any.whl:
Publisher:
release.yml on emmtrix/emx-regex-cgen
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
emx_regex_cgen-0.2.0-py3-none-any.whl -
Subject digest:
46640bb55b99fd0a37a75bf0f56b06de05e5c28023f3805c9b83908364423bcc - Sigstore transparency entry: 1087349310
- Sigstore integration time:
-
Permalink:
emmtrix/emx-regex-cgen@4987e87a09e40b6f01187d568ed5c496fa3186af -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/emmtrix
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4987e87a09e40b6f01187d568ed5c496fa3186af -
Trigger Event:
release
-
Statement type: