Skip to main content

Regex to C code generator

Project description

emmtrix Regex-to-C Code Generator (emx-regex-cgen) compiles regular expressions into portable, static C code for embedded and performance-critical applications.

CI License: MIT

Key Characteristics

  • Two backend engines — choose between a table-driven DFA and a bit-parallel NFA (--engine dfa or --engine bitnfa).
  • Table-driven DFA (default) — every regex is compiled to a minimised deterministic finite automaton; the generated C code performs a single linear scan over the input.
  • Bit-parallel NFA — the regex is compiled to a Thompson NFA and simulated with precomputed bitmasks; the hot loop is fully unrolled with no loops over words. The variant (uint8, uint16, uint32, or uint32_t[N]) is selected automatically based on the number of NFA positions.
  • No dynamic memory allocation — all data is static const; no malloc, no free.
  • Branch-free inner loop — the matching loop contains a single table lookup per byte (DFA) or unrolled bitwise operations (bitnfa).
  • re2 feature set — supports the same subset of regular expressions as Google's re2 library (no back-references, no look-around).
  • Fullmatch semantics — the generated function checks whether the entire input matches the pattern.

Installation

pip install -e ".[dev]"

The distribution name is emx-regex-cgen. The Python import path is emx_regex_cgen.

Quick Start

Python Library

generate() returns a GeneratedCode object with four parts. Call .render() to get the combined C source string, or access individual fields to embed them in an existing code-generation pipeline.

from emx_regex_cgen import generate

# generate() returns a GeneratedCode object
result = generate(r"\d{4}-\d{2}-\d{2}")

# .render() assembles the complete C source (same output as before)
print(result.render())

# Access individual parts:
result.includes        # ['stddef.h', 'stdbool.h', 'stdint.h']
result.globals         # static const transition table (and optional maps)
result.match_function  # bool regex_match(const char *input, size_t len) { … }
result.main_function   # int main(…) { … }  — None when emit_main=False

# Include a main() for standalone testing
result = generate(r"[a-z]+", emit_main=True)

# Bytes mode: '.' matches any single byte, classes work on raw byte values
result = generate(r"[\x80-\xff]+", encoding="bytes")

# Use the bit-parallel NFA backend instead of DFA
result = generate(r"hello", engine="bitnfa")

CLI

# Write generated C code to stdout
emx-regex-cgen '[a-z]+\d+'

# Write to a file with a main() function
emx-regex-cgen '[a-z]+\d+' --emit-main -o matcher.c

# Compile and test
gcc -O2 -o matcher matcher.c
./matcher "hello42"   # exit 0 (match)
./matcher "HELLO"     # exit 1 (no match)

# Bytes mode: match any sequence of high bytes
emx-regex-cgen --encoding bytes '[\x80-\xff]+' --emit-main -o byte_matcher.c

# Use the bit-parallel NFA backend
emx-regex-cgen --engine bitnfa 'hello' --emit-main -o bitnfa_matcher.c

CLI Reference

usage: emx-regex-cgen [-h] [-o OUTPUT] [--emit-main] [--prefix PREFIX]
                  [--flags FLAGS] [--encoding {utf8,bytes}]
                  [--engine {dfa,bitnfa}]
                  [--row-dedup {yes,no,auto}]
                  [--alphabet-compression {yes,no,auto}]
                  [--size-threshold SIZE_THRESHOLD]
                  [--early-exit {yes,no}] pattern

Generate C code that performs a fullmatch for a regular expression.

positional arguments:
  pattern               Regular expression pattern

options:
  -o, --output          Output file (default: stdout)
  --emit-main           Also emit a main() function (exit 0=match, 1=no match, 2=error)
  --prefix PREFIX       Prefix for all generated C identifiers (default: regex)
  --flags FLAGS         Regex flags: i (case-insensitive), s (dot-all), m (multiline)
  --encoding {utf8,bytes}
                        Input encoding: utf8 (default, Unicode-aware) or bytes
                        (raw byte semantics)
  --engine {dfa,bitnfa}
                        Backend engine: dfa (table-driven minimised DFA; default),
                        bitnfa (bit-parallel NFA)
  --row-dedup {yes,no,auto}
                        Transition-row deduplication: yes (always), no (never),
                        auto (when table exceeds --size-threshold; default). DFA only.
  --alphabet-compression {yes,no,auto}
                        Alphabet compression into equivalence classes: yes (always),
                        no (never), auto (when table exceeds --size-threshold; default).
                        DFA only.
  --size-threshold N    Table-size threshold (cells = states × 256) for auto mode
                        (default: 8192). DFA only.
  --early-exit {yes,no}
                        Emit early-exit check in DFA loop: yes (break when dead state
                        is reached), no (always process full input; default). DFA only.

Supported Features

Regex Features

Feature Example pattern DFA bitnfa
Literal string hello literal.c literal_bitnfa.c
Character class [a-z0-9_]+ char_class.c char_class_bitnfa.c
Negated class [^aeiou]+ negated_class.c negated_class_bitnfa.c
Dot (any char except \n) .+ dot.c dot_bitnfa.c
Alternation cat|dog|fish alternation.c alternation_bitnfa.c
Star quantifier * ab*c quantifier_star.c quantifier_star_bitnfa.c
Plus quantifier + ab+c quantifier_plus.c quantifier_plus_bitnfa.c
Optional quantifier ? colou?r quantifier_optional.c quantifier_optional_bitnfa.c
Bounded repeat {m,n} a{2,4} quantifier_repeat.c quantifier_repeat_bitnfa.c
Digit escape \d \d{4}-\d{2}-\d{2} escape_digit.c escape_digit_bitnfa.c
Word escape \w \w+ escape_word.c escape_word_bitnfa.c
Space escape \s \s+ escape_space.c escape_space_bitnfa.c
Unicode / UTF-8 \x{00e9}+ unicode.c unicode_bitnfa.c
Anchors ^ / $ ^start.*end$ anchors.c anchors_bitnfa.c

CLI Options

Option Effect Golden reference
--flags i Case-insensitive matching flag_ignorecase.c · bitnfa
--flags s Dot matches \n (dot-all) flag_dotall.c · bitnfa
--flags m Multiline anchors flag_multiline.c · bitnfa
--flags x Verbose / free-spacing mode flag_verbose.c · bitnfa
--encoding bytes Raw byte semantics (no UTF-8) encoding_bytes.c · bitnfa
--prefix NAME Custom identifier prefix prefix.c · bitnfa
--emit-main Include standalone main() emit_main.c · bitnfa
--alphabet-compression yes Byte equivalence-class compression alphabet_compression.c
--row-dedup yes Transition-row deduplication row_dedup.c
--early-exit yes Break DFA loop on dead state (early exit) early_exit.c

Bit-NFA Variants

The bitnfa engine automatically selects the narrowest integer type that can hold all NFA positions:

Variant NFA positions Golden reference
uint8_t ≤ 8 bitnfa_uint8.c
uint16_t ≤ 16 bitnfa_uint16.c
uint32_t ≤ 32 bitnfa_uint32.c
uint32_t[N] > 32 bitnfa_uint32_array.c

Generated Code Structure

DFA Engine (default)

The generator produces:

  1. An optional alphabet map (static const uint8_t regex_alphabet[256]) mapping each byte to its equivalence class (emitted when alphabet compression is active).
  2. A transition table (static const uint8_t regex_transitions[M][C]) mapping (row, column) → next_state. Dimensions depend on active optimisations: M is the number of unique rows (with row dedup) or states; C is the number of equivalence classes (with alphabet compression) or 256.
  3. An optional row map (static const uint8_t regex_row_map[N]) mapping state → row index (emitted when row deduplication is active).
  4. A match function with the signature:
    bool regex_match(const char *input, size_t len);
    
  5. Optionally a main() function for standalone executables.

Example Output (pattern hello, --alphabet-compression yes --row-dedup yes)

static const uint8_t regex_alphabet[256] = { /* byte → class */ };

static const uint8_t regex_transitions[6][5] = {
    /* states 0, 6 */ { 0 },
    /* state 1 */     { [2] = 4 },
    ...
};

static const uint8_t regex_row_map[7] = { 0, 1, 2, 3, 4, 5, 0 };

bool regex_match(const char *input, size_t len) {
    uint8_t state = 1;
    for (size_t i = 0; i < len; i++) {
        state = regex_transitions[regex_row_map[state]][regex_alphabet[(unsigned char)input[i]]];
    }
    return state >= 6;
}

Bit-NFA Engine (--engine bitnfa)

The generator produces:

  1. A transition mask table (static const uint16_t regex_trans[P][256]) where P is the number of NFA positions. Each entry is a bitmask of destination positions for the given (source position, input byte) pair. The integer width (uint8_t / uint16_t / uint32_t / uint32_t[N]) is chosen automatically.
  2. A match function with a fully unrolled inner loop — one if (state & bit) next |= table[pos][b]; per active position.
  3. Optionally a main() function for standalone executables.

Example Output (pattern hello, --engine bitnfa)

static const uint16_t regex_trans[10][256] = {
    /* position 0 */ { ['h'] = 0x0006u },
    /* position 1 */ { 0 },
    /* position 2 */ { ['e'] = 0x0018u },
    ...
};

bool regex_match(const char *input, size_t len) {
    uint16_t state = 0x0001u;
    for (size_t i = 0; i < len; i++) {
        unsigned char b = (unsigned char)input[i];
        uint16_t next = 0;
        if (state & 0x0001u) next |= regex_trans[0][b];
        if (state & 0x0004u) next |= regex_trans[2][b];
        if (state & 0x0010u) next |= regex_trans[4][b];
        if (state & 0x0040u) next |= regex_trans[6][b];
        if (state & 0x0100u) next |= regex_trans[8][b];
        state = next;
    }
    return (state & 0x0200u) != 0;
}

Testing

Tests are parameterised from re2_compat_results.json, which contains 2 500+ patterns extracted from the PCRE2 test suite and validated against Google re2.

# Run all tests (parallel)
pytest -n auto -q

# Run linter
ruff check src/ tests/

Test Strategy

  1. Generate C code from the regex pattern.
  2. Compile with gcc -O2.
  3. Execute the binary with each test subject as argv[1].
  4. Compare the exit code against the expected match/no-match result.

Development

# Clone
git clone --recurse-submodules https://github.com/emmtrix/emx-regex-cgen.git
cd emx-regex-cgen

# Install
pip install -e ".[dev]"

# Test
pytest -n auto -q

# Lint
ruff check src/ tests/

License

MIT — Copyright (c) 2026 emmtrix Technologies GmbH

Maintained by

emmtrix Technologies GmbH

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emx_regex_cgen-0.1.1.tar.gz (122.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emx_regex_cgen-0.1.1-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file emx_regex_cgen-0.1.1.tar.gz.

File metadata

  • Download URL: emx_regex_cgen-0.1.1.tar.gz
  • Upload date:
  • Size: 122.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_regex_cgen-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a00e2a7c53612b2273bee226eb26618ca06cbdb2459a210c6ecd9e5fa30441d4
MD5 0856952895f1803be42a99f0daa50418
BLAKE2b-256 52ceae1d91c839797743dc28f891939c4c0d6fb9bc9f91a22de9361b9af09399

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_regex_cgen-0.1.1.tar.gz:

Publisher: release.yml on emmtrix/emx-regex-cgen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file emx_regex_cgen-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: emx_regex_cgen-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_regex_cgen-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4592da2cc3d51e536b8a8fcd1623fa44603a7622d7e7d9ceb90295289941baf1
MD5 a96e9e7b9e0e19506687feddee259e25
BLAKE2b-256 398750993541bf03ddefc08730ff572e75d01bd749d07fffe25ff81b739ab51c

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_regex_cgen-0.1.1-py3-none-any.whl:

Publisher: release.yml on emmtrix/emx-regex-cgen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page