emx-regex-cgen

Regex to C code generator

These details have not been verified by PyPI

Project description

emmtrix Regex-to-C Code Generator (emx-regex-cgen) compiles regular expressions into portable, static C code for embedded and performance-critical applications.

Key Characteristics

Two backend engines — choose between a table-driven DFA and a bit-parallel NFA (--engine dfa, --engine bitnfa, or the default --engine auto which tries DFA first and falls back to bitnfa when the DFA state limit is exceeded).
Table-driven DFA — every regex is compiled to a minimised deterministic finite automaton; the generated C code performs a single linear scan over the input.
Bit-parallel NFA — the regex is compiled to a Thompson NFA and simulated with precomputed bitmasks; the hot loop is fully unrolled with no loops over words. The variant (uint8, uint16, uint32, or uint32_t[N]) is selected automatically based on the number of NFA positions.
No dynamic memory allocation — all data is static const; no malloc, no free.
Branch-free inner loop — the matching loop contains a single table lookup per byte (DFA) or unrolled bitwise operations (bitnfa).
re2 feature set — supports the same subset of regular expressions as Google's re2 library (no back-references, no look-around).
Fullmatch semantics — the generated function checks whether the entire input matches the pattern.

Installation

pip install emx-regex-cgen

The distribution name is emx-regex-cgen. The Python import path is emx_regex_cgen.

After installation, you can use the package either as a CLI tool via emx-regex-cgen or as a Python library via from emx_regex_cgen import generate.

Quick Start

Python Library

generate() returns a GeneratedCode object with four parts. Call .render() to get the combined C source string, or access individual fields to embed them in an existing code-generation pipeline.

from emx_regex_cgen import generate

# generate() returns a GeneratedCode object
result = generate(r"\d{4}-\d{2}-\d{2}")

# .render() assembles the complete C source (same output as before)
print(result.render())

# Access individual parts:
result.includes        # ['stddef.h', 'stdbool.h', 'stdint.h']
result.globals         # static const transition table (and optional maps)
result.match_function  # bool regex_match(const char *input, size_t len) { … }
result.main_function   # int main(…) { … }  — None when emit_main=False

# Include a main() for standalone testing
result = generate(r"[a-z]+", emit_main=True)

# Bytes mode: '.' matches any single byte, classes work on raw byte values
result = generate(r"[\x80-\xff]+", encoding="bytes")

# Use the bit-parallel NFA backend instead of DFA
result = generate(r"hello", engine="bitnfa")

`generate()` options

generate(
    pattern,
    flags="",
    *,
    emit_main=False,
    prefix="regex",
    encoding="utf8",
    engine="auto",
    row_dedup="auto",
    alphabet_compression="auto",
    size_threshold=8192,
    early_exit=False,
)

Library arguments map directly to the generator settings:

Argument	Type	Default	Meaning
`pattern`	`str`	-	Regular expression to compile.
`flags`	`str`	`""`	Regex flags: `i` (ignore case), `s` (dot-all), `m` (multiline), `x` (verbose syntax accepted for compatibility).
`emit_main`	`bool`	`False`	Include a standalone `main()` function in the generated C code.
`prefix`	`str`	`"regex"`	Prefix for generated C identifiers such as `regex_match`.
`encoding`	`str`	`"utf8"`	`"utf8"` for Unicode-aware UTF-8 input, or `"bytes"` for raw byte semantics.
`engine`	`str`	`"auto"`	Backend engine: `"auto"` (DFA first, bitnfa fallback), `"dfa"`, or `"bitnfa"`.
`row_dedup`	`str`	`"auto"`	DFA only: `"yes"`, `"no"`, or `"auto"` for transition-row deduplication.
`alphabet_compression`	`str`	`"auto"`	DFA only: `"yes"`, `"no"`, or `"auto"` for byte equivalence-class compression.
`size_threshold`	`int`	`8192`	DFA only: threshold used by `"auto"` for `row_dedup` and `alphabet_compression`.
`early_exit`	`bool`	`False`	DFA only: break out of the loop once the dead state is reached.

Example with multiple options:

from emx_regex_cgen import generate

result = generate(
    r"[a-z]+\d+",
    flags="i",
    prefix="user_id",
    emit_main=True,
    engine="dfa",
    encoding="utf8",
    row_dedup="yes",
    alphabet_compression="auto",
    size_threshold=4096,
    early_exit=True,
)

print(result.render())

CLI

# Write generated C code to stdout
emx-regex-cgen '[a-z]+\d+'

# Write to a file with a main() function
emx-regex-cgen '[a-z]+\d+' --emit-main -o matcher.c

# Compile and test
gcc -O2 -o matcher matcher.c
./matcher "hello42"   # exit 0 (match)
./matcher "HELLO"     # exit 1 (no match)

# Bytes mode: match any sequence of high bytes
emx-regex-cgen --encoding bytes '[\x80-\xff]+' --emit-main -o byte_matcher.c

# Use the bit-parallel NFA backend
emx-regex-cgen --engine bitnfa 'hello' --emit-main -o bitnfa_matcher.c

CLI Reference

usage: emx-regex-cgen [-h] [-o OUTPUT] [--emit-main] [--prefix PREFIX]
                  [--flags FLAGS] [--encoding {utf8,bytes}]
                  [--engine {auto,dfa,bitnfa}]
                  [--row-dedup {yes,no,auto}]
                  [--alphabet-compression {yes,no,auto}]
                  [--size-threshold SIZE_THRESHOLD]
                  [--early-exit {yes,no}] pattern

Generate C code that performs a fullmatch for a regular expression.

positional arguments:
  pattern               Regular expression pattern

options:
  -o, --output          Output file (default: stdout)
  --emit-main           Also emit a main() function (exit 0=match, 1=no match, 2=error)
  --prefix PREFIX       Prefix for all generated C identifiers (default: regex)
  --flags FLAGS         Regex flags: i (case-insensitive), s (dot-all), m (multiline)
  --encoding {utf8,bytes}
                        Input encoding: utf8 (default, Unicode-aware) or bytes
                        (raw byte semantics)
  --engine {auto,dfa,bitnfa}
                        Backend engine: auto (try DFA first, fall back to bitnfa; default),
                        dfa (table-driven minimised DFA),
                        bitnfa (bit-parallel NFA)
  --row-dedup {yes,no,auto}
                        Transition-row deduplication: yes (always), no (never),
                        auto (when table exceeds --size-threshold; default). DFA only.
  --alphabet-compression {yes,no,auto}
                        Alphabet compression into equivalence classes: yes (always),
                        no (never), auto (when table exceeds --size-threshold; default).
                        DFA only.
  --size-threshold N    Table-size threshold (cells = states × 256) for auto mode
                        (default: 8192). DFA only.
  --early-exit {yes,no}
                        Emit early-exit check in DFA loop: yes (break when dead state
                        is reached), no (always process full input; default). DFA only.

Supported Features

Regex Features

Feature	Example pattern	DFA	bitnfa
Literal string	`hello`	literal.c	literal_bitnfa.c
Character class	`[a-z0-9_]+`	char_class.c	char_class_bitnfa.c
Negated class	`[^aeiou]+`	negated_class.c	negated_class_bitnfa.c
Dot (any char except `\n`)	`.+`	dot.c	dot_bitnfa.c
Alternation	`cat\|dog\|fish`	alternation.c	alternation_bitnfa.c
Star quantifier `*`	`ab*c`	quantifier_star.c	quantifier_star_bitnfa.c
Plus quantifier `+`	`ab+c`	quantifier_plus.c	quantifier_plus_bitnfa.c
Optional quantifier `?`	`colou?r`	quantifier_optional.c	quantifier_optional_bitnfa.c
Bounded repeat `{m,n}`	`a{2,4}`	quantifier_repeat.c	quantifier_repeat_bitnfa.c
Digit escape `\d`	`\d{4}-\d{2}-\d{2}`	escape_digit.c	escape_digit_bitnfa.c
Word escape `\w`	`\w+`	escape_word.c	escape_word_bitnfa.c
Space escape `\s`	`\s+`	escape_space.c	escape_space_bitnfa.c
Unicode / UTF-8	`\x{00e9}+`	unicode.c	unicode_bitnfa.c
Anchors `^` / `$`	`^start.*end$`	anchors.c	anchors_bitnfa.c
Word boundary `\b` / `\B`	`\bword\b`	word_boundary.c	word_boundary_bitnfa.c
Unicode property `\p{…}` / `\P{…}`	`\p{Nd}+`	unicode_property.c	unicode_property_bitnfa.c

CLI Options

Option	Effect	Golden reference
`--flags i`	Case-insensitive matching	flag_ignorecase.c · bitnfa
`--flags s`	Dot matches `\n` (dot-all)	flag_dotall.c · bitnfa
`--flags m`	Multiline anchors	flag_multiline.c · bitnfa
`--flags x`	Verbose / free-spacing mode	flag_verbose.c · bitnfa
`--encoding bytes`	Raw byte semantics (no UTF-8)	encoding_bytes.c · bitnfa
`--prefix NAME`	Custom identifier prefix	prefix.c · bitnfa
`--emit-main`	Include standalone `main()`	emit_main.c · bitnfa
`--alphabet-compression yes`	Byte equivalence-class compression	alphabet_compression.c
`--row-dedup yes`	Transition-row deduplication	row_dedup.c
`--early-exit yes`	Break DFA loop on dead state (early exit)	early_exit.c

Bit-NFA Variants

The bitnfa engine automatically selects the narrowest integer type that can hold all NFA positions:

Variant	NFA positions	Golden reference
`uint8_t`	≤ 8	bitnfa_uint8.c
`uint16_t`	≤ 16	bitnfa_uint16.c
`uint32_t`	≤ 32	bitnfa_uint32.c
`uint32_t[N]`	> 32	bitnfa_uint32_array.c

Generated Code Structure

DFA Engine (default)

The generator produces:

An optional alphabet map (static const uint8_t regex_alphabet[256]) mapping each byte to its equivalence class (emitted when alphabet compression is active).
A transition table (static const uint8_t regex_transitions[M][C]) mapping (row, column) → next_state. Dimensions depend on active optimisations: M is the number of unique rows (with row dedup) or states; C is the number of equivalence classes (with alphabet compression) or 256.
An optional row map (static const uint8_t regex_row_map[N]) mapping state → row index (emitted when row deduplication is active).

A match function with the signature:

bool regex_match(const char *input, size_t len);

Optionally a main() function for standalone executables.

Example Output (pattern `hello`, `--alphabet-compression yes --row-dedup yes`)

static const uint8_t regex_alphabet[256] = { /* byte → class */ };

static const uint8_t regex_transitions[6][5] = {
    /* states 0, 6 */ { 0 },
    /* state 1 */     { [2] = 4 },
    ...
};

static const uint8_t regex_row_map[7] = { 0, 1, 2, 3, 4, 5, 0 };

bool regex_match(const char *input, size_t len) {
    uint8_t state = 1;
    for (size_t i = 0; i < len; i++) {
        state = regex_transitions[regex_row_map[state]][regex_alphabet[(unsigned char)input[i]]];
    }
    return state >= 6;
}

Bit-NFA Engine (`--engine bitnfa`)

The generator produces:

A transition mask table (static const uint16_t regex_trans[P][256]) where P is the number of NFA positions. Each entry is a bitmask of destination positions for the given (source position, input byte) pair. The integer width (uint8_t / uint16_t / uint32_t / uint32_t[N]) is chosen automatically.
A match function with a fully unrolled inner loop — one if (state & bit) next |= table[pos][b]; per active position.
Optionally a main() function for standalone executables.

Example Output (pattern `hello`, `--engine bitnfa`)

static const uint16_t regex_trans[10][256] = {
    /* position 0 */ { ['h'] = 0x0006u },
    /* position 1 */ { 0 },
    /* position 2 */ { ['e'] = 0x0018u },
    ...
};

bool regex_match(const char *input, size_t len) {
    uint16_t state = 0x0001u;
    for (size_t i = 0; i < len; i++) {
        unsigned char b = (unsigned char)input[i];
        uint16_t next = 0;
        if (state & 0x0001u) next |= regex_trans[0][b];
        if (state & 0x0004u) next |= regex_trans[2][b];
        if (state & 0x0010u) next |= regex_trans[4][b];
        if (state & 0x0040u) next |= regex_trans[6][b];
        if (state & 0x0100u) next |= regex_trans[8][b];
        state = next;
    }
    return (state & 0x0200u) != 0;
}

Testing

Tests are parameterised from re2_compat_results.json, which contains 2 500+ patterns extracted from the PCRE2 test suite and validated against Google re2.

# Run all tests (parallel)
pytest -n auto -q

# Run linter
ruff check src/ tests/

Test Strategy

Generate C code from the regex pattern.
Compile with gcc -O2.
Execute the binary with each test subject as argv[1].
Compare the exit code against the expected match/no-match result.

Development

# Clone
git clone --recurse-submodules https://github.com/emmtrix/emx-regex-cgen.git
cd emx-regex-cgen

# Install
pip install emx-regex-cgen

# Or, for local development
pip install -e ".[dev]"

# Test
pytest -n auto -q

# Lint
ruff check src/ tests/

License

Maintained by

emmtrix Technologies GmbH

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Mar 11, 2026

0.1.3

Mar 10, 2026

0.1.1

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emx_regex_cgen-0.2.0.tar.gz (158.0 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

emx_regex_cgen-0.2.0-py3-none-any.whl (43.8 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file emx_regex_cgen-0.2.0.tar.gz.

File metadata

Download URL: emx_regex_cgen-0.2.0.tar.gz
Upload date: Mar 11, 2026
Size: 158.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_regex_cgen-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`85d20adff0aac9d37443f4a691ca2557238303abea280ea4154ae5d82987ac4e`
MD5	`0b92b1a1d4cf4f31e81061e3f3dd04bd`
BLAKE2b-256	`1ce3355ff14d7da5e17ed52cb416f0c0c8be429e4a703916eb44799ad88919d3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_regex_cgen-0.2.0.tar.gz:

Publisher: release.yml on emmtrix/emx-regex-cgen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: emx_regex_cgen-0.2.0.tar.gz
- Subject digest: 85d20adff0aac9d37443f4a691ca2557238303abea280ea4154ae5d82987ac4e
- Sigstore transparency entry: 1087349235
- Sigstore integration time: Mar 11, 2026
Source repository:
- Permalink: emmtrix/emx-regex-cgen@4987e87a09e40b6f01187d568ed5c496fa3186af
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/emmtrix
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4987e87a09e40b6f01187d568ed5c496fa3186af
- Trigger Event: release

File details

Details for the file emx_regex_cgen-0.2.0-py3-none-any.whl.

File metadata

Download URL: emx_regex_cgen-0.2.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 43.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_regex_cgen-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46640bb55b99fd0a37a75bf0f56b06de05e5c28023f3805c9b83908364423bcc`
MD5	`3133acf04c7cda221fe53d6ecd6161d8`
BLAKE2b-256	`1cef6a434f5b288d4d495179755b8eb10a6c83b9e22736f61a38fc25e6fb4c29`

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_regex_cgen-0.2.0-py3-none-any.whl:

Publisher: release.yml on emmtrix/emx-regex-cgen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: emx_regex_cgen-0.2.0-py3-none-any.whl
- Subject digest: 46640bb55b99fd0a37a75bf0f56b06de05e5c28023f3805c9b83908364423bcc
- Sigstore transparency entry: 1087349310
- Sigstore integration time: Mar 11, 2026
Source repository:
- Permalink: emmtrix/emx-regex-cgen@4987e87a09e40b6f01187d568ed5c496fa3186af
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/emmtrix
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4987e87a09e40b6f01187d568ed5c496fa3186af
- Trigger Event: release

emx-regex-cgen 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Key Characteristics

Installation

Quick Start

Python Library

generate() options

CLI

CLI Reference

Supported Features

Regex Features

CLI Options

Bit-NFA Variants

Generated Code Structure

DFA Engine (default)

Example Output (pattern hello, --alphabet-compression yes --row-dedup yes)

Bit-NFA Engine (--engine bitnfa)

Example Output (pattern hello, --engine bitnfa)

Testing

Test Strategy

Development

License

Maintained by

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`generate()` options

Example Output (pattern `hello`, `--alphabet-compression yes --row-dedup yes`)

Bit-NFA Engine (`--engine bitnfa`)

Example Output (pattern `hello`, `--engine bitnfa`)