Skip to main content

Regex to C code generator

Project description

emmtrix Regex-to-C Code Generator (emx-regex-cgen) compiles regular expressions into portable, static C code for embedded and performance-critical applications.

CI License: MIT

Key Characteristics

  • Two backend engines — choose between a table-driven DFA and a bit-parallel NFA (--engine dfa, --engine bitnfa, or the default --engine auto which tries DFA first and falls back to bitnfa when the DFA state limit is exceeded).
  • Table-driven DFA — every regex is compiled to a minimised deterministic finite automaton; the generated C code performs a single linear scan over the input.
  • Bit-parallel NFA — the regex is compiled to a Thompson NFA and simulated with precomputed bitmasks; the hot loop is fully unrolled with no loops over words. The variant (uint8, uint16, uint32, or uint32_t[N]) is selected automatically based on the number of NFA positions.
  • No dynamic memory allocation — all data is static const; no malloc, no free.
  • Branch-free inner loop — the matching loop contains a single table lookup per byte (DFA) or unrolled bitwise operations (bitnfa).
  • re2 feature set — supports the same subset of regular expressions as Google's re2 library (no back-references, no look-around).
  • Fullmatch semantics — the generated function checks whether the entire input matches the pattern.

Installation

pip install emx-regex-cgen

The distribution name is emx-regex-cgen. The Python import path is emx_regex_cgen.

After installation, you can use the package either as a CLI tool via emx-regex-cgen or as a Python library via from emx_regex_cgen import generate.

Quick Start

Python Library

generate() returns a GeneratedCode object with four parts. Call .render() to get the combined C source string, or access individual fields to embed them in an existing code-generation pipeline.

from emx_regex_cgen import generate

# generate() returns a GeneratedCode object
result = generate(r"\d{4}-\d{2}-\d{2}")

# .render() assembles the complete C source (same output as before)
print(result.render())

# Access individual parts:
result.includes        # ['stddef.h', 'stdbool.h', 'stdint.h']
result.globals         # static const transition table (and optional maps)
result.match_function  # bool regex_match(const char *input, size_t len) { … }
result.main_function   # int main(…) { … }  — None when emit_main=False

# Include a main() for standalone testing
result = generate(r"[a-z]+", emit_main=True)

# Bytes mode: '.' matches any single byte, classes work on raw byte values
result = generate(r"[\x80-\xff]+", encoding="bytes")

# Use the bit-parallel NFA backend instead of DFA
result = generate(r"hello", engine="bitnfa")

generate() options

generate(
    pattern,
    flags="",
    *,
    emit_main=False,
    prefix="regex",
    encoding="utf8",
    engine="auto",
    row_dedup="auto",
    alphabet_compression="auto",
    size_threshold=8192,
    early_exit=False,
)

Library arguments map directly to the generator settings:

Argument Type Default Meaning
pattern str - Regular expression to compile.
flags str "" Regex flags: i (ignore case), s (dot-all), m (multiline), x (verbose syntax accepted for compatibility).
emit_main bool False Include a standalone main() function in the generated C code.
prefix str "regex" Prefix for generated C identifiers such as regex_match.
encoding str "utf8" "utf8" for Unicode-aware UTF-8 input, or "bytes" for raw byte semantics.
engine str "auto" Backend engine: "auto" (DFA first, bitnfa fallback), "dfa", or "bitnfa".
row_dedup str "auto" DFA only: "yes", "no", or "auto" for transition-row deduplication.
alphabet_compression str "auto" DFA only: "yes", "no", or "auto" for byte equivalence-class compression.
size_threshold int 8192 DFA only: threshold used by "auto" for row_dedup and alphabet_compression.
early_exit bool False DFA only: break out of the loop once the dead state is reached.

Example with multiple options:

from emx_regex_cgen import generate

result = generate(
    r"[a-z]+\d+",
    flags="i",
    prefix="user_id",
    emit_main=True,
    engine="dfa",
    encoding="utf8",
    row_dedup="yes",
    alphabet_compression="auto",
    size_threshold=4096,
    early_exit=True,
)

print(result.render())

CLI

# Write generated C code to stdout
emx-regex-cgen '[a-z]+\d+'

# Write to a file with a main() function
emx-regex-cgen '[a-z]+\d+' --emit-main -o matcher.c

# Compile and test
gcc -O2 -o matcher matcher.c
./matcher "hello42"   # exit 0 (match)
./matcher "HELLO"     # exit 1 (no match)

# Bytes mode: match any sequence of high bytes
emx-regex-cgen --encoding bytes '[\x80-\xff]+' --emit-main -o byte_matcher.c

# Use the bit-parallel NFA backend
emx-regex-cgen --engine bitnfa 'hello' --emit-main -o bitnfa_matcher.c

CLI Reference

usage: emx-regex-cgen [-h] [-o OUTPUT] [--emit-main] [--prefix PREFIX]
                  [--flags FLAGS] [--encoding {utf8,bytes}]
                  [--engine {auto,dfa,bitnfa}]
                  [--row-dedup {yes,no,auto}]
                  [--alphabet-compression {yes,no,auto}]
                  [--size-threshold SIZE_THRESHOLD]
                  [--early-exit {yes,no}] pattern

Generate C code that performs a fullmatch for a regular expression.

positional arguments:
  pattern               Regular expression pattern

options:
  -o, --output          Output file (default: stdout)
  --emit-main           Also emit a main() function (exit 0=match, 1=no match, 2=error)
  --prefix PREFIX       Prefix for all generated C identifiers (default: regex)
  --flags FLAGS         Regex flags: i (case-insensitive), s (dot-all), m (multiline)
  --encoding {utf8,bytes}
                        Input encoding: utf8 (default, Unicode-aware) or bytes
                        (raw byte semantics)
  --engine {auto,dfa,bitnfa}
                        Backend engine: auto (try DFA first, fall back to bitnfa; default),
                        dfa (table-driven minimised DFA),
                        bitnfa (bit-parallel NFA)
  --row-dedup {yes,no,auto}
                        Transition-row deduplication: yes (always), no (never),
                        auto (when table exceeds --size-threshold; default). DFA only.
  --alphabet-compression {yes,no,auto}
                        Alphabet compression into equivalence classes: yes (always),
                        no (never), auto (when table exceeds --size-threshold; default).
                        DFA only.
  --size-threshold N    Table-size threshold (cells = states × 256) for auto mode
                        (default: 8192). DFA only.
  --early-exit {yes,no}
                        Emit early-exit check in DFA loop: yes (break when dead state
                        is reached), no (always process full input; default). DFA only.

Supported Features

Regex Features

Feature Example pattern DFA bitnfa
Literal string hello literal.c literal_bitnfa.c
Character class [a-z0-9_]+ char_class.c char_class_bitnfa.c
Negated class [^aeiou]+ negated_class.c negated_class_bitnfa.c
Dot (any char except \n) .+ dot.c dot_bitnfa.c
Alternation cat|dog|fish alternation.c alternation_bitnfa.c
Star quantifier * ab*c quantifier_star.c quantifier_star_bitnfa.c
Plus quantifier + ab+c quantifier_plus.c quantifier_plus_bitnfa.c
Optional quantifier ? colou?r quantifier_optional.c quantifier_optional_bitnfa.c
Bounded repeat {m,n} a{2,4} quantifier_repeat.c quantifier_repeat_bitnfa.c
Digit escape \d \d{4}-\d{2}-\d{2} escape_digit.c escape_digit_bitnfa.c
Word escape \w \w+ escape_word.c escape_word_bitnfa.c
Space escape \s \s+ escape_space.c escape_space_bitnfa.c
Unicode / UTF-8 \x{00e9}+ unicode.c unicode_bitnfa.c
Anchors ^ / $ ^start.*end$ anchors.c anchors_bitnfa.c
Word boundary \b / \B \bword\b word_boundary.c word_boundary_bitnfa.c
Unicode property \p{…} / \P{…} \p{Nd}+ unicode_property.c unicode_property_bitnfa.c

CLI Options

Option Effect Golden reference
--flags i Case-insensitive matching flag_ignorecase.c · bitnfa
--flags s Dot matches \n (dot-all) flag_dotall.c · bitnfa
--flags m Multiline anchors flag_multiline.c · bitnfa
--flags x Verbose / free-spacing mode flag_verbose.c · bitnfa
--encoding bytes Raw byte semantics (no UTF-8) encoding_bytes.c · bitnfa
--prefix NAME Custom identifier prefix prefix.c · bitnfa
--emit-main Include standalone main() emit_main.c · bitnfa
--alphabet-compression yes Byte equivalence-class compression alphabet_compression.c
--row-dedup yes Transition-row deduplication row_dedup.c
--early-exit yes Break DFA loop on dead state (early exit) early_exit.c

Bit-NFA Variants

The bitnfa engine automatically selects the narrowest integer type that can hold all NFA positions:

Variant NFA positions Golden reference
uint8_t ≤ 8 bitnfa_uint8.c
uint16_t ≤ 16 bitnfa_uint16.c
uint32_t ≤ 32 bitnfa_uint32.c
uint32_t[N] > 32 bitnfa_uint32_array.c

Generated Code Structure

DFA Engine (default)

The generator produces:

  1. An optional alphabet map (static const uint8_t regex_alphabet[256]) mapping each byte to its equivalence class (emitted when alphabet compression is active).
  2. A transition table (static const uint8_t regex_transitions[M][C]) mapping (row, column) → next_state. Dimensions depend on active optimisations: M is the number of unique rows (with row dedup) or states; C is the number of equivalence classes (with alphabet compression) or 256.
  3. An optional row map (static const uint8_t regex_row_map[N]) mapping state → row index (emitted when row deduplication is active).
  4. A match function with the signature:
    bool regex_match(const char *input, size_t len);
    
  5. Optionally a main() function for standalone executables.

Example Output (pattern hello, --alphabet-compression yes --row-dedup yes)

static const uint8_t regex_alphabet[256] = { /* byte → class */ };

static const uint8_t regex_transitions[6][5] = {
    /* states 0, 6 */ { 0 },
    /* state 1 */     { [2] = 4 },
    ...
};

static const uint8_t regex_row_map[7] = { 0, 1, 2, 3, 4, 5, 0 };

bool regex_match(const char *input, size_t len) {
    uint8_t state = 1;
    for (size_t i = 0; i < len; i++) {
        state = regex_transitions[regex_row_map[state]][regex_alphabet[(unsigned char)input[i]]];
    }
    return state >= 6;
}

Bit-NFA Engine (--engine bitnfa)

The generator produces:

  1. A transition mask table (static const uint16_t regex_trans[P][256]) where P is the number of NFA positions. Each entry is a bitmask of destination positions for the given (source position, input byte) pair. The integer width (uint8_t / uint16_t / uint32_t / uint32_t[N]) is chosen automatically.
  2. A match function with a fully unrolled inner loop — one if (state & bit) next |= table[pos][b]; per active position.
  3. Optionally a main() function for standalone executables.

Example Output (pattern hello, --engine bitnfa)

static const uint16_t regex_trans[10][256] = {
    /* position 0 */ { ['h'] = 0x0006u },
    /* position 1 */ { 0 },
    /* position 2 */ { ['e'] = 0x0018u },
    ...
};

bool regex_match(const char *input, size_t len) {
    uint16_t state = 0x0001u;
    for (size_t i = 0; i < len; i++) {
        unsigned char b = (unsigned char)input[i];
        uint16_t next = 0;
        if (state & 0x0001u) next |= regex_trans[0][b];
        if (state & 0x0004u) next |= regex_trans[2][b];
        if (state & 0x0010u) next |= regex_trans[4][b];
        if (state & 0x0040u) next |= regex_trans[6][b];
        if (state & 0x0100u) next |= regex_trans[8][b];
        state = next;
    }
    return (state & 0x0200u) != 0;
}

Testing

Tests are parameterised from re2_compat_results.json, which contains 2 500+ patterns extracted from the PCRE2 test suite and validated against Google re2.

# Run all tests (parallel)
pytest -n auto -q

# Run linter
ruff check src/ tests/

Test Strategy

  1. Generate C code from the regex pattern.
  2. Compile with gcc -O2.
  3. Execute the binary with each test subject as argv[1].
  4. Compare the exit code against the expected match/no-match result.

Development

# Clone
git clone --recurse-submodules https://github.com/emmtrix/emx-regex-cgen.git
cd emx-regex-cgen

# Install
pip install emx-regex-cgen

# Or, for local development
pip install -e ".[dev]"

# Test
pytest -n auto -q

# Lint
ruff check src/ tests/

License

MIT — Copyright (c) 2026 emmtrix Technologies GmbH

Maintained by

emmtrix Technologies GmbH

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emx_regex_cgen-0.2.0.tar.gz (158.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emx_regex_cgen-0.2.0-py3-none-any.whl (43.8 kB view details)

Uploaded Python 3

File details

Details for the file emx_regex_cgen-0.2.0.tar.gz.

File metadata

  • Download URL: emx_regex_cgen-0.2.0.tar.gz
  • Upload date:
  • Size: 158.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_regex_cgen-0.2.0.tar.gz
Algorithm Hash digest
SHA256 85d20adff0aac9d37443f4a691ca2557238303abea280ea4154ae5d82987ac4e
MD5 0b92b1a1d4cf4f31e81061e3f3dd04bd
BLAKE2b-256 1ce3355ff14d7da5e17ed52cb416f0c0c8be429e4a703916eb44799ad88919d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_regex_cgen-0.2.0.tar.gz:

Publisher: release.yml on emmtrix/emx-regex-cgen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file emx_regex_cgen-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: emx_regex_cgen-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 43.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emx_regex_cgen-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 46640bb55b99fd0a37a75bf0f56b06de05e5c28023f3805c9b83908364423bcc
MD5 3133acf04c7cda221fe53d6ecd6161d8
BLAKE2b-256 1cef6a434f5b288d4d495179755b8eb10a6c83b9e22736f61a38fc25e6fb4c29

See more details on using hashes here.

Provenance

The following attestation bundles were made for emx_regex_cgen-0.2.0-py3-none-any.whl:

Publisher: release.yml on emmtrix/emx-regex-cgen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page