lexical analysis, tokenisers
Project description
Lexical analysis functions, tokenisers.
An arbitrary assortment of lexical and tokenisation functions useful for writing recursive descent parsers, of which I have several.
Generally the get_* functions accept a source string and an offset (often optional, default 0) and return a token and the new offset, raising ValueError on failed tokenisation.
Function as_lines(chunks, partials=None)
Generator yielding complete lines from arbitrary pieces text
from the iterable chunks
.
After completion, any remaining newline-free chunks remain in the partials list; this will be unavailable to the caller unless the list is presupplied.
Function get_chars(s, offset, gochars)
Scan the string s
for characters in gochars
starting at offset
.
Return (match, new_offset).
Function get_decimal(s, offset=0)
Scan the string s
for decimal characters starting at offset
.
Return dec_string, new_offset.
Function get_delimited(s, offset, delim)
Collect text from the string s
from position offset
up to the first occurence of delimiter delim
; return the text excluding the delimiter and the offset after the delimiter.
Function get_dotted_identifier(s, offset=0, **kw)
Scan the string s
for a dotted identifier (by default an
ASCII letter or underscore followed by letters, digits or
underscores) with optional trailing dot and another dotted
identifier, starting at offset
(default 0).
Return (match, new_offset).
The empty string and an unchanged offset will be returned if there is no leading letter/underscore.
Function get_envvar(s, offset=0, environ=None, default=None, specials=None)
Parse a simple environment variable reference to $varname or $x where "x" is a special character.
Paramaters:
s
: the string with the variable referenceoffset
: the starting point for the referencedefault
: default value for missing environment variables; if None (the default) a ValueError is raisedenviron
: the environment mapping, default os.environspecials
: the mapping of special single character variables
Function get_hexadecimal(s, offset=0)
Scan the string s
for hexadecimal characters starting at offset
.
Return hex_string, new_offset.
Function get_identifier(s, offset=0, alpha='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', number='0123456789', extras='_')
Scan the string s
for an identifier (by default an ASCII
letter or underscore followed by letters, digits or underscores)
starting at offset
(default 0).
Return (match, new_offset).
The empty string and an unchanged offset will be returned if there is no leading letter/underscore.
Function get_nonwhite(s, offset=0)
Scan the string s
for characters not in string.whitespace
starting at offset
(default 0).
Return (match, new_offset).
Function get_other_chars(s, offset=0, stopchars=None)
Scan the string s
for characters not in stopchars
starting
at offset
(default 0).
Return (match, new_offset).
Function get_qstr(s, offset=0, q='"', environ=None, default=None, env_specials=None)
Get quoted text with slosh escapes and optional environment substitution.
s
: the string containg the quoted text.
offset
: the starting point, default 0.
q
: the quote character, default '"'. If q
is set to None,
do not expect the string to be delimited by quote marks.
environ
: if not None, also parse and expand $envvar references.
default
: passed to get_envvar
Function get_sloshed_text(s, delim, offset=0, slosh='\\', mapper=<function slosh_mapper at 0x1040b4268>, specials=None)
Collect slosh escaped text from the string s
from position
offset
(default 0) and return the decoded unicode string and
the offset of the completed parse.
Parameters:
delim
: end of string delimiter, such as a single or double quote.
offset
: starting offset within s
, default 0.
slosh
: escape character, default a slosh ('').
mapper
: a mapping function which accepts a single character
and returns a replacement string or None; this is used the
replace things such as '\t' or '\n'. The default is the
slosh_mapper function, whose default mapping is SLOSH_CHARMAP.
specials
: a mapping of other special character sequences and parse
functions for gathering them up. When one of the special
character sequences is found in the string, the parse
function is called to parse at that point.
The parse functions accept
s
and the offset of the special character. They return
the decoded string and the offset past the parse.
The escape character slosh
introduces an encoding of some
replacement text whose value depends on the following character.
If the following character is:
- the escape character
slosh
, insert the escape character. - the string delimiter
delim
, insert the delimiter. - the character 'x', insert the character with code from the following 2 hexadecimal digits.
- the character 'u', insert the character with code from the following 4 hexadecimal digits.
- the character 'U', insert the character with code from the following 8 hexadecimal digits.
- a character from the keys of mapper
Function get_tokens(s, offset, getters)
Parse the string s
from position offset
using the supplied tokenise functions getters
; return the list of tokens matched and the final offset.
s
: the string to parse.
offset
: the starting position for the parse.
getters
: an iterable of tokeniser specifications.
Each tokeniser specification is either:
- a callable expecting (s, offset) and returning (token, new_offset)
- a literal string, to be matched exactly
- a tuple or list with values (func, args, kwargs); call func(s, offset, *args, **kwargs)
- an object with a .match method such as a regex; call getter.match(s, offset) and return a match object with a .end() method returning the offset of the end of the match
Function get_uc_identifier(s, offset=0, number='0123456789', extras='_')
Scan the string s
for an identifier as for get_identifier(), but require the letters to be uppercase.
Function get_white(s, offset=0)
Scan the string s
for characters in string.whitespace
starting at offset
(default 0).
Return (match, new_offset).
Function htmlify(s, nbsp=False)
Convert a string for safe transcription in HTML.
Parameters:
s
: the stringnbsp
: replaces spaces with " " to prevent word folding, defaultFalse
.
Function is_dotted_identifier(s, offset=0, **kw)
Test if the string s
is an identifier from position offset
onward.
Function is_identifier(s, offset=0, **kw)
Test if the string s
is an identifier from position offset
onward.
Function isUC_(s)
Check that a string matches ^[A-Z][A-Z_0-9]*$.
Function jsquote(s)
Quote a string for use in JavaScript.
Function lastlinelen(s)
The length of text after the last newline in a string. Initially used by cs.hier to compute effective text width.
Function match_tokens(s, offset, getters)
Wrapper for get_tokens which catches ValueError exceptions and returns (None, offset).
Function parseUC_sAttr(attr)
Take an attribute name and return (key, isplural). FOO returns (FOO, False). FOOs or FOOes returns (FOO, True). Otherwise return (None, False).
Function phpquote(s)
Quote a string for use in PHP code.
Function skipwhite(s, offset=0)
Convenience routine for skipping past whitespace; returns offset of next nonwhitespace character.
Function slosh_mapper(c, charmap={'a': '\x07', 'b': '\x08', 'f': '\x0c', 'n': '\n', 'r': '\r', 't': '\t', 'v': '\x0b'})
Return a string to replace backslash-c
, or None.
Function stripped_dedent(s)
Slightly smarter dedent.
Strip the supplied string s
. Pull off the leading line.
Dedent the rest. Put back the leading line.
Function texthexify(bs, shiftin='[', shiftout=']', whitelist=None)
Transcribe the bytes bs
to text.
whitelist
: a bytes or string object indicating byte values
which may be represented directly in text; string objects are
converted to hexify() and texthexify() output strings may be
freely concatenated and decoded with untexthexify().
Function unctrl(s, tabsize=8)
Return the string s
with TABs expanded and control characters
replaced with printably representations.
Function untexthexify(s, shiftin='[', shiftout=']')
Decode a textual representation of binary data into binary data.
Outside of the shiftin
/shiftout
markers the binary data
are represented as hexadecimal. Within the markers the bytes
have the values of the ordinals of the characters.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file cs.lex-20180810.tar.gz
.
File metadata
- Download URL: cs.lex-20180810.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.23.0 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5feb4b40ec82987c02b6d4543885f5c0d9d26cb5e37e9327879c8c152c0386cb |
|
MD5 | 2011039a6649e7c2d22fdf48bba793ea |
|
BLAKE2b-256 | a6d0b46b824f86b6e962807711b9afb4fe37dfc9c317d8eb6d3648ff82de89a4 |