Skip to main content

lexical analysis, tokenisers

Project description

An assortment of lexcial and tokenisation functions useful for writing recursive descent parsers, of which I have several.

Generally the get_* functions accept a source string and an offset (often optional, default 0) and return a token and the new offset, raising ValueError on failed tokenisation.

  • as_lines(chunks, partials=None): parse text chunks, yield complete individual lines

  • get_chars(s, offset, gochars): collect adjacent characters from gochars

  • get_delimited(s, offset, delim): collect text up to the first ocurrence of the character delim.

  • get_envvar(s, offset=0, environ=None, default=None, specials=None): parse an environment variable reference such as $foo

  • get_identifier(s, offset=0, alpha=ascii_letters, number=digits, extras=’_’): parse an identifier

  • get_nonwhite(s, offset=0): collect nonwhitespace characters

  • get_other_chars(s, offset=0, stopchars=None): collect adjacent characters not from stopchars

  • get_qstr(s, offset=0, q=’”’, environ=None, default=None, env_specials=None): collect a quoted string, honouring slosh escapes and optionally expanding environment variable references

  • get_sloshed_text(s, delim, offset=0, slosh=’', mapper=slosh_mapper, specials=None): collect some slosh escaped text with optional special tokens (such as ‘$’ introducing ‘$foo’)

  • get_tokens(s, offset, getters): collect a sequence of tokens specified in getters

  • match_tokens(s, offset, getters): wrapper for get_tokens which catches ValueError and returns None instead

  • get_uc_identifier(s, offset=0, number=digits, extras=’_’): collect an UPPERCASE identifier

  • get_white(s, offset=0): collect whitespace characters

  • isUC_(s): test if a string looks like an upper case identifier

  • htmlify(s,nbsp=False): transcribe text in HTML safe form, using &lt; for “<”, etc

  • htmlquote(s): transcribe text as HTML quoted string suitable for HTML tag attribute values

  • jsquote(s): transcribe text as JSON quoted string; essentially like htmlquote without its htmlify step

  • parseUC_sAttr(attr): parse FOO or FOOs (or FOOes) and return (FOO, is_plural)

  • slosh_mapper(c, charmap=SLOSH_CHARMAP): return a string to replace c; the default charmap matches Python slosh escapes

  • texthexify(bs, shiftin=’[’, shiftout=’]’, whitelist=None): a function like binascii.hexlify but also supporting embedded “printable text” subsequences for compactness and human readbility in the result; initial use case was for transcription of binary data with frequent text, specificly directory entry data

  • untexthexify(s, shiftin=’[’, shiftout=’]’): the inverse of texthexify()

  • unctrl(s,tabsize=8): transcribe text removing control characters

  • unrfc2047(s): accept RFC2047 encoded text as found in mail message headers and decode

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cs.lex-20150120.tar.gz (8.1 kB view details)

Uploaded Source

File details

Details for the file cs.lex-20150120.tar.gz.

File metadata

  • Download URL: cs.lex-20150120.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cs.lex-20150120.tar.gz
Algorithm Hash digest
SHA256 84efcc0725cf99fe953148c49c7d0e8c7e283c244f0ea039f6decfac54c98235
MD5 b8ed5aede3eb8b527acd72417413ef31
BLAKE2b-256 b25f83512d8dcd5e5d92bca9a048f2749863fa0f3f2964b4e0c18de761b72f47

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page