Skip to main content

A simple tokenizer operating on enums with a decent amount of configuration

Project description

Crossandra

Crossandra is a fast and simple tokenization library for Python operating on enums and regular expressions, with a decent amount of configuration.

Installation

Crossandra is available on PyPI and can be installed with pip, or any other Python package manager:

$ pip install crossandra

(Some systems may require you to use pip3, python -m pip, or py -m pip instead)

License

Crossandra is licensed under the MIT License.

Reference

Crossandra

Crossandra(
    token_source: type[Enum] = Empty,
    *,
    ignore_whitespace: bool = False,
    ignored_characters: str = "",
    suppress_unknown: bool = False,
    rules: list[Rule | RuleGroup] | None = None
)
  • token_source: an enum containing all possible tokens (defaults to an empty enum)
  • ignore_whitespace: whether spaces, tabs, newlines etc. should be ignored
  • ignored_characters: characters to skip during tokenization
  • suppress_unknown: whether unknown tokens should continue without throwing an error
  • rules: a list of additional rules to use

The enum takes priority over the rule list.


When all tokens are of length 1 and there are no additional rules, Crossandra will use a simpler tokenization method (the so called Fast Mode).

Example: Tokenizing noisy Brainfuck code (tested on MacBook Air M1 (256/16))

# Setup
from random import choices
from string import punctuation

program = "".join(choices(punctuation, k=...))
k Default Fast Mode Speedup
10 0.00004s 0.00002s 2x
100 0.00016s 0.00003s 5.3x
1000 0.0015s 0.00013s 11.5x
10000 0.014s 0.0009s 15.6x
100000 0.29s 0.009s 32.2x

Rule

Rule[T](
    pattern: str,
    converter: Callable[[str], T] | bool = True,
    flags: RegexFlag | int = 0
)

Used for defining custom rules. pattern is a regex pattern to match (flags can be supplied).
When converter is a callable, it's used on the matched substring.
When converter is True, it will directly return the matched substring.
When converter is False, it will not include the matched substring in the token list.

RuleGroup

RuleGroup(rules: tuple[Rule[Any], ...])

Used for storing multiple Rules in one object. Can be constructed by ORing two or more Rules.

common

The common submodule is a collection of commonly used patterns.

Rules:

  • CHAR (e.g. 'h')
  • LETTER (e.g. m)
  • WORD (e.g. ball)
  • SINGLE_QUOTED_STRING (e.g. 'nice fish')
  • DOUBLE_QUOTED_STRING (e.g. "hello there")
  • C_NAME (e.g. crossandra_rocks)
  • NEWLINE (\n; \r\n is converted to \n before tokenization)
  • DIGIT (e.g. 7)
  • HEXDIGIT (e.g. c)
  • DECIMAL (e.g. 3.14)
  • INT (e.g. 2137)
  • SIGNED_INT (e.g. -1)
  • FLOAT (e.g. 1e3)
  • SIGNED_FLOAT (e.g. +4.3)

RuleGroups:

  • STRING (SINGLE_QUOTED_STRING | DOUBLE_QUOTED_STRING)
  • NUMBER (INT | FLOAT)
  • SIGNED_NUMBER (SIGNED_INT | SIGNED_FLOAT)

Examples

from enum import Enum
from crossandra import Crossandra

class Brainfuck(Enum):
    ADD = "+"
    SUB = "-"
    LEFT = "<"
    RIGHT = ">"
    READ = ","
    WRITE = "."
    BEGIN_LOOP = "["
    END_LOOP = "]"

bf = Crossandra(Brainfuck, suppress_unknown=True)
print(*bf.tokenize("cat program: ,[.,]"), sep="\n")
# Brainfuck.READ
# Brainfuck.BEGIN_LOOP
# Brainfuck.WRITE
# Brainfuck.READ
# Brainfuck.END_LOOP
from crossandra import Crossandra, Rule, common

def hex2rgb(hex_color: str) -> tuple[int, int, int]:
    r, g, b = (int(hex_color[i:i+2], 16) for i in range(1, 6, 2))
    return r, g, b

t = Crossandra(
    ignore_whitespace=True,
    rules=[
        Rule(r"#[0-9a-fA-F]+", hex2rgb),
        common.WORD
    ]
)

text = "My favorite color is #facade"
print(t.tokenize(text))
# ['My', 'favorite', 'color', 'is', (250, 202, 222)]
# Supporting Samarium's numbers and arithmetic operators
from enum import Enum
from crossandra import Crossandra, Rule

def sm_int(string: str) -> int:
    return int(string.replace("/", "1").replace("\\", "0"), 2)

class Op(Enum):
    ADD = "+"
    SUB = "-"
    MUL = "++"
    DIV = "--"
    POW = "+++"
    MOD = "---"

sm = Crossandra(
    Op,
    ignore_whitespace=True,
    rules=[Rule(r"(?:\\|/)+", sm_int)]
)

print(*sm.tokenize(r"//\ ++ /\\/ --- /\/\/ - ///"))
# 6 Op.MUL 9 Op.MOD 21 Op.SUB 7

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crossandra-1.2.4.tar.gz (6.5 kB view hashes)

Uploaded Source

Built Distributions

crossandra-1.2.4-cp311-cp311-win_amd64.whl (74.1 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

crossandra-1.2.4-cp311-cp311-win32.whl (65.1 kB view hashes)

Uploaded CPython 3.11 Windows x86

crossandra-1.2.4-cp311-cp311-musllinux_1_1_x86_64.whl (156.7 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

crossandra-1.2.4-cp311-cp311-musllinux_1_1_i686.whl (159.6 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

crossandra-1.2.4-cp311-cp311-musllinux_1_1_aarch64.whl (155.8 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ ARM64

crossandra-1.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (162.0 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

crossandra-1.2.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (159.9 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

crossandra-1.2.4-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (164.9 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

crossandra-1.2.4-cp311-cp311-macosx_10_9_x86_64.whl (88.7 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

crossandra-1.2.4-cp311-cp311-macosx_10_9_universal2.whl (168.4 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

crossandra-1.2.4-cp310-cp310-win_amd64.whl (74.5 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

crossandra-1.2.4-cp310-cp310-win32.whl (65.5 kB view hashes)

Uploaded CPython 3.10 Windows x86

crossandra-1.2.4-cp310-cp310-musllinux_1_1_x86_64.whl (157.8 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

crossandra-1.2.4-cp310-cp310-musllinux_1_1_i686.whl (161.6 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

crossandra-1.2.4-cp310-cp310-musllinux_1_1_aarch64.whl (157.0 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ ARM64

crossandra-1.2.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (162.9 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

crossandra-1.2.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (160.9 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

crossandra-1.2.4-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (166.6 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

crossandra-1.2.4-cp310-cp310-macosx_10_9_x86_64.whl (89.4 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

crossandra-1.2.4-cp310-cp310-macosx_10_9_universal2.whl (170.0 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

crossandra-1.2.4-cp39-cp39-win_amd64.whl (74.5 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

crossandra-1.2.4-cp39-cp39-win32.whl (65.4 kB view hashes)

Uploaded CPython 3.9 Windows x86

crossandra-1.2.4-cp39-cp39-musllinux_1_1_x86_64.whl (157.6 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

crossandra-1.2.4-cp39-cp39-musllinux_1_1_i686.whl (161.3 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

crossandra-1.2.4-cp39-cp39-musllinux_1_1_aarch64.whl (156.8 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ ARM64

crossandra-1.2.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (162.6 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

crossandra-1.2.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (160.6 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

crossandra-1.2.4-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (166.2 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

crossandra-1.2.4-cp39-cp39-macosx_10_9_x86_64.whl (89.4 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

crossandra-1.2.4-cp39-cp39-macosx_10_9_universal2.whl (170.0 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page