Python Object Notation
Project description
Python Object Notation (PON) fills the need for simple, human usable and maintainable, configuration data file format for Python.
A basic example of a PON file:
dict( version = [1, 2, 3], # Major, Minor and Micro version number email = "avdn@europython.org", score = 8.7, restrictions = None, published = True, install_requires=dict( # nested structure any=['requests', 'beautifulsoup'], py26=['ordereddict'], ), oddkeys = { "2": 2, "s p a c e d": "out"} )
PON:
is readable and maintainable through familiarity for all Python programmers
allows comments
is editable by humans without breaking things every other time
is efficiently and securely loadable
Has several basic types: string, numbers (including simple math on the numbers), True, False, None, date(), datetime (in UTC, no timezone info),
basic structures: dict (dict() / {}) and list ([]),
extended structures: set (set() / {} (the latter only on 2.7+) and tuples (( ))
basic and extended structures which can be nested (within its own type as well as in the other types) with the following restrictions:
top level must be a dict
no syntax for self references
{ } can only having strings as keys
children of the extended structures (set, tuple) cannot be referenced for substitutions
has substitutions/interpolations on string values, with the substitute value, being the str() result of a possible complex value
can be embedded in a Python file as it is valid Python code, and provides routines to extract such an embedded configuration.
expects UTF-8 based input
has support for multi-line strings through normal triple-quoting, as well as for dedent-ing those multi-line strings
has a read-only parser with a small enough footprint that you do not require a package to be installed before you can use it (e.g. ~100 line parser in several setup.py files in packages on PyPI)
with a few restrictions, supports round-tripping without the loss of comments and without the loss of carefully ordered keys
Output is PEP8 compliant, except for allowing (but not requiring) spaces around the = following a dict() key
Why not YAML/XML/INI-CFG/JSON
YAML
YAML fulfils all of the requirements of the PON except for the string substitution on values. And it can fulfil more complex tasks, like handling (self) references and complex dict/mapping keys. However the basics are easily readable, not all of its syntactical variations might be familiar.
The main problem is that it requires an external library (ruamel.yaml for YAML1.2 (with round-trip preservation of comments), PyYAML for 1.1 (with guaranteed loss of comments)) and e.g. cannot be used within a setup.py.
XML
An XML parser ships with Python, so using that from any program and from setup.py is possible. XML combines the inefficiency of ASCII with the illegibility of binary. ‘nuff said.
INI
INI/CFG files are familiar to a lot of people, but unfortunately there are all kinds of variations and extensions. Python’s built-in configparser (ConfigParser in < Python 3.0), provides only limited structural information. It is essentially a dict, with the sections as keys for values that are a dict (the key = value pairs). It supports multi-line strings in a traditional way (indented continuation lines), comments and has interpolation in two different forms.
JSON
JSON is a format familiar to many for data exchange. However several things make it unusableq for configuration files that are manipulated by humans:
the inability to handle comments in most implementations (including Python’s json module), although the inventor specified these should be ignored
humans generating an invalid file on every other edit, because commas are separators, and are not allowed before list/dict terminal ] resp. }
JSON requires even simple dict keys (strings without spaces) to be obfuscated by quotes
Python has most of the batteries for PON included
In Python you can already do:
config = dict( version = [1, 2, 3], # Major, Minor and Micro version number email = "avdn@europython.org", score = 8.7, restrictions = None, published = True, install_requires=dict( # nested structure any=['requests', 'beautifulsoup',], py26=['ordereddict'] ), )
This is valid Python code, and the embedding requirement is implicitly taken care of. The problem is you don’t always want to evaluation/include/import this as code, because of security implications. You need to be able to parse this and reject invalid or dangerous constructs.
(Also note that you can delete the whole line containing py26= in the above without breaking things, although you end up with a comma before the closing parenthesis.)
Parsing out PON (almost) of the box
The function literal_eval from the ast module can parse a more JSON like variation of the previous config. E.g. the contents of the following file:
{ "version": [1, 2, 3], # Major, Minor and Micro version number "email": "avdn@europython.org", "score": 8.7, "restrictions": None, "published": True, "install_requires": { # nested structure "any": ['requests', 'beautifulsoup',], "py26": ['ordereddict'] }, }
using the following:
python -c 'import ast; ast.literal_eval(open("input2.pon").read())
The above can also be relatively easily parsed from a larger (Python source) file by looking for the assignment to a known variable, config = { and the corresponding } (usually at the same indentation level).
This is almost JSON, but to be able to include JSON in Python, as well as being able to parse that, you need to change it some more:
true = True null = None config = { "version": [1, 2, 3], "email": "avdn@europython.org", "score": 8.7, "restrictions": null, "published": true, "install_requires": { "any": ["requests", "beautifulsoup",], "py26": ["ordereddict"] } }
You have to define null and true for the Python parser to accept this. For most JSON parsers you also have to remove the comments, and consistently use double quotes for strings. And above all you have to remove trailing commas, which is most often forgotten when deleting whole key-value lines at the end of a dictionary/mapping in JSON (resulting in non-running programs unless you use ruamel.yaml/PyYAML to load your JSON files).
Actually, the above is not valid JSON (did you see the trailing comma in the list on the any line?). These problems don’t make JSON a bad format. It is fine for information interchange between programs. JSON files should just never be edited, and preferably not even have to be read, by humans.
A replacement for literal_eval
ast.literal_eval cannot deal with dict(), so using that you cannot have keys that are strings without quotes. It also throws a useless generic ValueError, when it is fed invalid strings, making it difficult to provide meaningful feedback to the human editor of the (invalid) configuration data. And finally it happily tries and fails to do its job when you feed it nonsense data like a float.
ast.literal_eval is a good example how you can make a minimal evaluator around the ast facilities. A small adaptation can handle the extras like dict, date and datetime. Thus allowing non-quoted simple keys, while disallowing non string keys for {}, forcing a toplevel dictionary. The code, including adaptations for 2.6 and later support (2.6’s cannot handle {} type sets):
import sys # NOQA import platform # NOQA import datetime # NOQA from textwrap import dedent # NOQA from _ast import * # NOQA if sys.version_info < (3, ): string_type = basestring else: string_type = str if sys.version_info < (3, 4): class Bytes(): pass class NameConstant: pass if sys.version_info < (2, 7) or platform.python_implementation() == 'Jython': class Set(): pass def loads(node_or_string, dict_typ=dict, return_ast=False, file_name=None): """ Safely evaluate an expression node or a string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None. """ if sys.version_info < (3, 4): _safe_names = {'None': None, 'True': True, 'False': False} if isinstance(node_or_string, string_type): node_or_string = compile( node_or_string, '<string>' if file_name is None else file_name, 'eval', PyCF_ONLY_AST) if isinstance(node_or_string, Expression): node_or_string = node_or_string.body else: raise TypeError("only string or AST nodes supported") def _convert(node, expect_string=False): if isinstance(node, (Str, Bytes)): return node.s if expect_string: pass elif isinstance(node, Num): return node.n elif isinstance(node, Tuple): return tuple(map(_convert, node.elts)) elif isinstance(node, List): return list(map(_convert, node.elts)) elif isinstance(node, Set): return set(map(_convert, node.elts)) elif isinstance(node, Dict): return dict_typ((_convert(k, expect_string=True), _convert(v)) for k, v in zip(node.keys, node.values)) elif isinstance(node, NameConstant): return node.value elif sys.version_info < (3, 4) and isinstance(node, Name): if node.id in _safe_names: return _safe_names[node.id] elif isinstance(node, UnaryOp) and \ isinstance(node.op, (UAdd, USub)) and \ isinstance(node.operand, (Num, UnaryOp, BinOp)): # NOQA operand = _convert(node.operand) if isinstance(node.op, UAdd): return + operand else: return - operand elif isinstance(node, BinOp) and \ isinstance(node.op, (Add, Sub, Mult)) and \ isinstance(node.right, (Num, UnaryOp, BinOp)) and \ isinstance(node.left, (Num, UnaryOp, BinOp)): # NOQA left = _convert(node.left) right = _convert(node.right) if isinstance(node.op, Add): return left + right elif isinstance(node.op, Mult): return left * right else: return left - right elif isinstance(node, Call): func_id = getattr(node.func, 'id', None) if func_id == 'dict': return dict_typ((k.arg, _convert(k.value)) for k in node.keywords) elif func_id == 'set': return set(_convert(node.args[0])) elif func_id == 'date': return datetime.date(*[_convert(k) for k in node.args]) elif func_id == 'datetime': return datetime.datetime(*[_convert(k) for k in node.args]) elif func_id == 'dedent': return dedent(*[_convert(k) for k in node.args]) elif isinstance(node, Name): return node.s err = SyntaxError('malformed node or string: ' + repr(node)) err.filename = '<string>' err.lineno = node.lineno err.offset = node.col_offset err.text = repr(node) err.node = node raise err res = _convert(node_or_string) if not isinstance(res, dict_typ): raise SyntaxError("Top level must be dict not " + repr(type(res))) if return_ast: return res, node_or_string return res
The above 109 lines of Python is the actual code, that loads the full PON from an iterable.
This code can be further reduced if you only need to support later Python versions, and if you know your input is restricted (no math, no set/tuples/{}, no datetime, etc)
SyntaxError
The ast.literal_eval gives you a generic ValueError without any indication of what might be wrong nor where things are wrong. From the SyntaxError that is raised on erroreous input by loads() you can retrieve useful line information:
error_str = u""" dict( a= u"α", b= False, c= date(2015, 9, 12), d= 1.37, ) """ try: loads(error_str) except SyntaxError as e: context = 2 from_line = e.lineno - (context + 1) to_line = e.lineno + (context - 1) w = len(str(to_line)) for index, line in enumerate(error_str.splitlines(True)): if from_line <= index <= to_line: print(u"{:{}}: {}".format(index, w, line), end=u'') if index == e.lineno - 1: print(u"{:{}} {}^--- {}".format( u' ', w, u' ' * e.offset, e.node))
giving you:
2: a= u"α", 3: b= False, 4: c= date(2015, 9, 12), ^--- <_ast.Call object at 0x7f1598d20950> 5: d= 1.37, 6: )
(the PON parser as indicated above extends ast.literal_eval with date() and doesn’t throw an eror on that input)
Motivation
The development of the literal_eval extension/replacement was motivated by cleaning providing version and other information from the __init__.py of a package to its setup.py file, thereby minimising the clutter of extra configuration files in the base directory (it is bad enough with setup.py, dist and tox.ini as non hidden files/directories.
A version number can be easily parsed from an __init__.py file. But allowing for more complex and complete configuration data allows setup.py to be the same for all of my projects.
Using the pon package
The pon package provides the the main parser loads(), the utility functions get(), store() and extract() and the PON class (for which the utilities are shortcuts).
get() and store()
If you have a configuration:
dict( a = dict( b = 24, c = [1, 3.14, {'d': 'klm'}], }
loaded into a variable config, you can access the value klm in the normal Python way by using config['a']['c'][2]['d']. PON also provides the function get() with which you can access the same value using get(config, 'a.c.2.d').
Based on the nested structure of config the “2” in that sequence is converted to an index. As indicated before, integers as dict keys, are not allowed, keys have to strings.
Complementary there is the store() function (set() being a reserved word in Python) that takes as third parameter a value, to set or overwrite an existing one: store(config, 'a.c.2.d', 'xyz')
Substitution with get()
Substitution (called interpolation in ConfigParser/configparser) is done by accessing a value of your configuration with with get(), and providing the extra keyword expand. Substitution is done recursively on the expanded value. You can provide the config object itself to expand:
val = get(config, 'some.path', expand=config)
and since this is such a common use case, you can specify expand=True instead of actually passing in config twice.
The syntax for substitution is the usual Python, "{key}".format(key=value), string formatting but the key can be a dotted sequence valid for get():
import pon config = pon.loads("""\ dict( a = dict( image = "http://{domain}/images", alt = "europython.eu", dd = (2011, 10, 2) # this is a tuple ), domain = 'python.{tld.organisations}', datestr = 'date{a.dd}', tld = {"organisations": "org", "commmercian": "com"} ) """) for key in ['a.image', 'datestr']: print(key, '->', pon.get(config, key, expand=True))
giving:
a.image -> http://python.org/images datestr -> date(2011, 10, 2)
The recursion for this is restricted to 10 levels.
The separator (by default “.”) can be set on the PON class. Since “:” is special in format strings, that character cannot be used as separator.
RoundTripping PON
With some restrictions it is possible to round trip PON, while preserving comments, in the smae way ruamel.yaml can for YAML:
you will not lose any data
on the first round-trip your formatting might change
a second round-trip will result in the same output as the first round-trip
In order to facilitate round-tripping extra information needs to be kept that is not available in the normal dict you get from the loading of your PON data structure into Python. This extra processing can be done up-front, after which the original configuration data is no longer necessary in text form, but wastes time during loading in case the round-tripping is never needed. It can be extracted on demand, but in that case the original textual data needs to be available. This time vs storage trade of is currently done at load time, and only when using the PON class (and not when using utility function loads()).
If you create a PON object from input (a file, a string or a list of strings) using PON(input) the resulting object will have information about the dict keys and list elements and has comment information associated with these keys.
The primary purpose for round-tripping is updating existing information in the configuration: updating one of the tuple values for key “version”, adding a dependency package to the list necessary for py26. If a whole new configuration file, including comments, needs to be generated, this can generally be done more easily by using (or starting from) a text template than to try and procedurally built the structure.
Changes on first round trip
Partly due to the pprint code on which the dumper is based, partly due to arbitrary decisions on what kind of formatting info is preserved and partly depending on your input, the following happens on the first round-trip:
the last element on a multi-line dict/list gets a trailing comma
comments that cannot be associated with an dict key or list element on the same line get moved to the next key/element (or the end of the main dict if no following elements)
strings are single quoted unless they contain a single quote (and no double quotes)
indent levels are a at 4 spaces
space around the equal sign between dict keys and values is removed
sets and tuples elements cannot be associated with comments, hence comments wander if they are not on the same line as a dict-key/list-element
sets and tuples are dumped on a single line, any dicts and lines underneath them are currently inaccessible for get() and therefore keys/elements for such dicts/lists cannot be associated with comments
extra lines with white space are silently dropped
add/subtract/multiply is not preserved
the datetime repr() drops trailing milliseconds if and seconds if they equal zero.
No data or comments get lost, unless you manipulate dict keys and/or list length. And if the output from a dump is taken as source there should be no further “wandering”. The following input:
try: from cStringIO import StringIO as _StringIO except ImportError: from io import StringIO as _StringIO from pon import PON input = """ dict( pckgs = dict( any=['package1', 'package2'], py26=['another package', 'and one with a long name', 'and on a new line'] # where do you go? ), ) """ out1 = _StringIO() p1 = PON(input) p1.dump(out1) print(out1.getvalue()) out2 = _StringIO() p2 = PON(out1.getvalue()) p2.dump(out2) print('roundtrip 1: {0}, roundtrip 2: {1}'.format( input == out1.getvalue(), out1.getvalue() == out2.getvalue()))
gives the following output:
dict( pckgs=dict( any=['package1', 'package2'], py26=['another package', 'and one with a long name', 'and on a new line', # where do you go? ], ), ) roundtrip 1: False, roundtrip 2: True
Further improvements
The set and tuple elements could be indexed, and then comments could be associated with their elements, and multi-line dumping would be better preserved.
The dump() could take parameters about indentation depth and on string quoting information.
The dump() output should be PEP8 compliant in principle. But IMO the removal of spaces around the = in a multi-line keyword argument assignment for dict() doesn’t make thing more readable. A parameter to select one or the other would be useful:
dict( a='1234324', b=['xyz', 'klm'], )
is less easy to read than:
dict( a = '1234324', b = ['xyz', 'klm'], )
Keeping the # of comments on multiple consecutive lines aligned, even if a value was changed before dumping and has become longer.
Showcase
The following program contains most (if not all) of the facilities and round-trips:
from io import StringIO as _StringIO from pon import PON configs = u'''\ # example config # should contain all types and facilities dict( s='abc', # single line string # multiline string mls="""one two three""", mls_dedent=dedent(""" abc def """), ghi={'A': 1, 'B': 2}, klm=['Airbus 370', 'Fokker 100'], opq=set([2, 3, 5, 7, 9]), rst=(0, 1, 1, 2, 3, 5, 8, 13), # Fibonacci m={u'π': 3.14}, anniversary=date(2011, 10, 2), dts=datetime(1919, 12, 1, 13, 45, 4), milisec=datetime(1922, 10, 19, 17, 55, 23, 321), six=2 + 4, secs_per_day=24 * 60 * 60, two=-2 - -4, # if you want to extend, do it here ) # and it's over ''' out = _StringIO() p = PON(configs) p.dump(out) conf_adjust_for_calc = configs # calculations are not preserved, they don't round trip, so adjust here for x, y in (('2 + 4', '6'), ('24 * 60 * 60', "{}".format(24 * 60 * 60)), ('-2 - -4', '2')): conf_adjust_for_calc = conf_adjust_for_calc.replace(x, y) outl = out.getvalue().splitlines(True) orgl = conf_adjust_for_calc.splitlines(True) if outl == orgl: print('roundtrip 1: equal') else: import difflib diff = difflib.unified_diff(orgl, outl, 'input', 'output') for line in diff: print(line, end='')
with output:
--- input +++ output @@ -3,9 +3,7 @@ dict( s='abc', # single line string # multiline string - mls="""one - two - three""", + mls='one\n two\n three', mls_dedent=dedent(""" abc def
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Comments
Dumping the loaded PON structure as text, assuming some formatting criteria, is relatively easy, if we could just ignore the fact that comments are important for future readers of the dumped information.
The Python built-in compile() function generating the AST information from which the object holding the configuration information is extracted, throws away the comments. So the comment information has to be re-associated with the object, and in addition to determining what comment belongs where, this requires that the elements in the object tree can be extended (requiring more complex objects that behave like dicts/lists, but have extra slots for comment information, a method which is also used in ruamel.yaml), or that a shadow structure is kept in the same form as the configuration object.
PON follows a hybrid by requiring the dictionaries to have key insertion ordering (the ordering of the keys in the source configuration data) as well as keeping a shadow structure. The shadow structure is extracted from the AST tree (used for generating/checking the configuration loaded) with tokenization information (which provides the comments).
Comments are associated with dictionary keys or list elements as far as these are “on their own line”. A full line comment belongs, or consecutive comments belong, to the next key/element if it is on a line of its own after the previous key/element. An end-of-line comment follows a key/element at the end of a line (and there can be only one). Additionally track is kept of comments before the initial, top level, dict(, after the final key (there is no following one to hook it up to) and after the dict closing ) token.
An example of a heavily commented PON file:
If you change dictionary keys, comments associated with these will generally get lost. So do comments associated with key/element that get deleted.