lexical analysis, tokenisers
Project description
An assortment of lexcial and tokenisation functions useful for writing recursive descent parsers, of which I have several.
Generally the get_* functions accept a source string and an offset (often optional, default 0) and return a token and the new offset, raising ValueError on failed tokenisation.
as_lines(chunks, partials=None): parse text chunks, yield complete individual lines
get_chars(s, offset, gochars): collect adjacent characters from gochars
get_decimal(s, offset): collect decimal characters (0-9, string.digits)
get_delimited(s, offset, delim): collect text up to the first ocurrence of the character delim.
get_envvar(s, offset=0, environ=None, default=None, specials=None): parse an environment variable reference such as $foo
get_identifier(s, offset=0, alpha=ascii_letters, number=digits, extras=’_’): parse an identifier
get_nonwhite(s, offset=0): collect nonwhitespace characters
get_other_chars(s, offset=0, stopchars=None): collect adjacent characters not from stopchars
get_qstr(s, offset=0, q=’”’, environ=None, default=None, env_specials=None): collect a quoted string, honouring slosh escapes and optionally expanding environment variable references
get_sloshed_text(s, delim, offset=0, slosh=’', mapper=slosh_mapper, specials=None): collect some slosh escaped text with optional special tokens (such as ‘$’ introducing ‘$foo’)
get_tokens(s, offset, getters): collect a sequence of tokens specified in getters
match_tokens(s, offset, getters): wrapper for get_tokens which catches ValueError and returns None instead
get_uc_identifier(s, offset=0, number=digits, extras=’_’): collect an UPPERCASE identifier
get_white(s, offset=0): collect whitespace characters
isUC_(s): test if a string looks like an upper case identifier
htmlify(s,nbsp=False): transcribe text in HTML safe form, using < for “<”, etc
htmlquote(s): transcribe text as HTML quoted string suitable for HTML tag attribute values
jsquote(s): transcribe text as JSON quoted string; essentially like htmlquote without its htmlify step
parseUC_sAttr(attr): parse FOO or FOOs (or FOOes) and return (FOO, is_plural)
slosh_mapper(c, charmap=SLOSH_CHARMAP): return a string to replace c; the default charmap matches Python slosh escapes
texthexify(bs, shiftin=’[’, shiftout=’]’, whitelist=None): a function like binascii.hexlify but also supporting embedded “printable text” subsequences for compactness and human readbility in the result; initial use case was for transcription of binary data with frequent text, specificly directory entry data
untexthexify(s, shiftin=’[’, shiftout=’]’): the inverse of texthexify()
unctrl(s,tabsize=8): transcribe text removing control characters
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.