A Python-based stream editor for json documents
Project description
Python-based stream editor for json files. It is a simple setup that effectively works as a json-parsing awk, similar to jq, but allowing in-place editing and output of json documents as well, and using Python as the working language. It supports colorized output.
Motivation
This program exists for fairly minor convenience, and mostly for my own use. Whenever I end up needing to quickly edit some json, I find myself opening a Python REPL, writing a bunch of obvious loading code to load in the json, work on it a little bit, and then dump it back out to the relevant file. Also, whenever I end up needing to inspect json from a web page, I either curl it to a file and then do the same, or use requests or something to pull it directly in a Python REPL so I can properly inspect it, or I pipe it through Python’s json.tool and less.
This is meant to supplant those use cases entirely for my own uses. If you find it inconvenient to repeatedly undergo the busy work associated with working with or inspecting json data, and especially if you are most familiar and comfortable with the stream-editing way of doing things or spending time in a REPL, this tool might make things a little more convenient for you. It is also useful for inspecting and converting between formats, such as between msgpack and json.
Why not jq?
jq is a really great tool for a lot of what you would use this. I wrote this because jq doesn’t provide the user with a REPL to mangle data, and because Python is a much more powerful and flexible language for the modification process, especially if you want to access the filesystem or other I/O.
jq is a powerful program with a lot of development, active maintenence, maintainers, and its own filter language. If that’s what you want, use that. If you want a simple tool for loading json and working on it with python in either a stream or REPL fashion, this is probably a better fit.
Installation
pip3 install --user pyjawk
Note that pyjawk is Python 3 only.
Use
Display the help text with -h.
In all evaluated python, data represents the parsed input data.
The program is passed an input string either through stdin or a -i argument, and an output through -o or stdout. -f arguments may pass in script files that are run first. The data object is then serialized and output. -e arguments are similar to -f, but run afterward and run as python source text. -c may be used to enable compact output and may be specified multiple times for some output formats. A positional parameter, if present, is evaluated as a python expression and used to replace the data object.
-I and -O may be used to set the input and output formats, respectively.
-n and -N disable input and output respectively.
If -r / --repl is specified, instead of writing output after processing, the function to write to the output is registered in the environment as write, the arguments structure is registered as args, and a ptpython REPL is started up with the same environment.
Multiple command line tools are available, but they all only set the default input and output formats.
Formats
json
Available as the command line tool pyjawk
Supports 3 levels of compactness.
Outputs trailing newline except on highest compaction.
Supports colorized output.
yaml
Available as the command line tool pyyawk
Supports 3 levels of compactness.
Outputs trailing newline
Supports colorized output.
xml
Available as the command line tool pyxawk
Parses into a xml.etree.ElementTree.Element object and dumps as xml text. Uses xml.etree.ElementTree.tostring to dump. and if uncompacted, uses xml.dom.minidom to prettify.
Supports 2 levels of compactness.
Outputs trailing newline
Supports colorized output.
python
Available as the command line tool pypawk
Uses eval to pull in objects, and either pprint or repr to dump, depending on compactness.
Supports 3 levels of compactness.
Outputs trailing newline.
Supports colorized output.
msgpack
Available as the command line tool pymawk
string
Available as the command line tool pysawk
Simply reads input into a string and outputs data as a string, using str on it before dumping.
Outputs trailing newline except when compaction is requested.
bytes
Available as the command line tool pybawk
Simply reads input into bytes and outputs data as bytes.
The data object
The data object is much like the ones that are normally imported from the libraries, except all dictionaries for json, yaml, and msgpack are substituted for a subclass of dict that maintains order and gives attribute access to the dictionary storage. All of the following are equivalent:
data["foo"]["bar"][0]["baz"] = "spam"
data.foo.bar[0].baz = "spam"
data.foo.bar[0]["baz"] = "spam"
data["foo"]["bar"][0].baz = "spam"
Note that constructing a dict explicitly will not automatically construct this subclass. You can do that by importing pyjawk.attrdict.AttrDict:
data.foo = {
"bar": [{"baz": "spam"}],
}
# This will not work.
data.foo.bar[0].baz = "alot"
# The dicts are still ordinary dicts, so this will work
data.foo["bar"][0]["baz"] = "alot"
from pyjawk.attrdict import AttrDict as d
data.foo = d(
bar=[d(baz="spam")],
)
# Now this will work
data.foo.bar[0].baz = "alot"
Details on IO and arguments in the REPL
In the REPL, the program’s own argument namespace is available as args. Changing some of them is obvious (such as args.output, which is just a string, or args.no_input which is just a boolean), and some others are perhaps non-obvious (args.compact is an integer specifying the number of times it was present). Some of the arguments don’t make any sense to work with (such as args.input and args.input_format, because those are already finished by the time the REPL starts up).
The REPL does not write the output by default. To write the output with the REPL, the write() function must be called explicitly.
When you wish to use the REPL, stdin and stdout must be attached to a terminal. This means that you need to be taking input from a file, not a pipe, and the program may not be piped to anything else. This is necessary because ptpython needs stdin to be communicated with and stdout to communicate back to the user. If you wish to pipe something into pyjawk for REPL use, you’ll have to use a fifo, a temp file, or a process substitution as follows:
# With process redirection
pyjawk -ri <(curl 'https://httpbin.org/get?foo=bar&spam=spam')
# With a temp file
curl 'https://httpbin.org/get?foo=bar&spam=spam' > curltemp.json
pyjawk -ri curltemp.json
# With a fifo
mkfifo curl.fifo
curl 'https://httpbin.org/get?foo=bar&spam=spam' > curl.fifo &
pyjawk -ri curl.fifo
In every case, the data in the repl is:
>>> from pprint import pprint
>>> pprint(data)
{'args': {'foo': 'bar', 'spam': 'spam'},
'headers': {'Accept': '*/*',
'Host': 'httpbin.org',
'User-Agent': 'curl/7.65.0'},
'origin': '73.169.51.67, 73.169.51.67',
'url': 'https://httpbin.org/get?foo=bar&spam=spam'}
Examples
Dumping some data to past.ee
$ echo '{"a": "1", "b": null, "c": true, "d": false, "e": 7, "f": 8.5, "g": {"h": [1, 2, 3]}}' \
| pyjawk '{"sections": [{"contents": str(data)}]}' \
| curl -H 'Content-Type: application/json' -H 'X-Auth-Token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' -XPOST --data-binary '@-' https://api.paste.ee/v1/pastes
{"id":"umXKr","link":"https:\/\/paste.ee\/p\/umXKr","success":true}
With this, you can also do any arbitrary string data, and also extract the link from the output if you like:
$ echo this is some test data \
| pyjawk -Istring '{"sections": [{"contents": data}]}' \
| curl -H 'Content-Type: application/json' -H 'X-Auth-Token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' -XPOST --data-binary '@-' https://api.paste.ee/v1/pastes \
| pyjawk -Ostring 'data.link'
https://paste.ee/p/iomJR
Converting data between formats
$ echo '{"foo": "bar", "baz": ["spam", "Spam", {"SPAM?": "SPAM!"}]}' \
| pyjawk -Oyaml
baz:
- spam
- Spam
- SPAM?: SPAM!
foo: bar
Selecting a part of a data-structure with evals
$ echo '{"foo": "bar", "baz": ["spam", "Spam", {"SPAM?": "SPAM!"}]}' \
| pyjawk -c 'data.baz[2]'
{"SPAM?": "SPAM!"}
Extracting a value as a string
$ echo '{"foo": "bar", "baz": ["spam", "Spam", {"SPAM?": "SPAM!"}]}' \
| pyjawk -Ostring 'data.baz[1]'
Spam
Easily embedding string data from stdin into a json structure
$ echo 'this is a test string' \
| pyjawk -Istring -Ojson -c '{"foo": data}'
{"foo": "this is a test string\n"}
Relocating an xml child
$ echo '<root><foo><bar>first</bar></foo><baz /></root>' \
| pyxawk -e 'foo = list(data)[0]; bar = list(foo)[0]; baz = list(data)[1]; baz.append(bar); foo.remove(bar)'
<?xml version="1.0" ?>
<root>
<foo/>
<baz>
<bar>first</bar>
</baz>
</root>
The -e can also be specified separately:
$ echo '<root><foo><bar>first</bar></foo><baz /></root>' \
| pyxawk -e 'foo = list(data)[0]' -e 'bar = list(foo)[0]' -e 'baz = list(data)[1]' -e 'baz.append(bar)' -e 'foo.remove(bar)'
Or just as a script file:
$ echo '<root><foo><bar>first</bar></foo><baz /></root>' \
| pyxawk -f relocate.py
foo = list(data)[0]
bar = list(foo)[0]
baz = list(data)[1]
baz.append(bar)
foo.remove(bar)
Exploring a structure in a REPL
$ pyjawk -i<(echo '{"foo": "bar", "baz": ["spam", "Spam", {"SPAM?": "SPAM!"}]}') -r
>>> data
{'foo': 'bar', 'baz': ['spam', 'Spam', {'SPAM?': 'SPAM!'}]}
>>> write()
{
"foo": "bar",
"baz": [
"spam",
"Spam",
{
"SPAM?": "SPAM!"
}
]
}
>>> data = data.baz
>>> write()
[
"spam",
"Spam",
{
"SPAM?": "SPAM!"
}
]
Fixing Retroarch Playlists
If you had an issue with the way that RetroArch generates its playlist files for the Playstation (by default, it searches for .cue files, but not .bin), and had something like this in /tmp/Roms/psx, all Sony PlayStation games:
Alpha.bin Alpha.cue Bravo.bin Charlie.bin Delta.bin Delta.cue
You might end up with a playlist file like this:
{
"version": "1.2",
"default_core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
"default_core_name": "Sony - PlayStation (PCSX ReARMed)",
"label_display_mode": 0,
"right_thumbnail_mode": 0,
"left_thumbnail_mode": 0,
"items": [
{
"path": "/tmp/Roms/psx/Alpha.cue",
"label": "Alpha",
"core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
"core_name": "Sony - PlayStation (PCSX ReARMed)",
"crc32": "00000000|crc",
"db_name": "Sony - PlayStation.lpl"
},
{
"path": "/tmp/Roms/psx/Delta.cue",
"label": "Delta",
"core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
"core_name": "Sony - PlayStation (PCSX ReARMed)",
"crc32": "00000000|crc",
"db_name": "Sony - PlayStation.lpl"
}
]
}
If you want the file to just have the bins, you can easily scan the directory for these files and modify the json using this tool with this:
$ pyjawk -i 'Sony - PlayStation.lpl' -o 'Sony - PlayStation.lpl' -e 'from pathlib import Path' -e 'data.items = [{"path": str(path), "label": path.stem, "core_path": data.default_core_path, "core_name": data.default_core_name, "crc32": "00000000|crc", "db_name": "Sony - PlayStation.lpl"} for path in (Path("/tmp") / "Roms" / "psx").iterdir() if path.suffix == ".bin"]'
Making the output
{
"version": "1.2",
"default_core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
"default_core_name": "Sony - PlayStation (PCSX ReARMed)",
"label_display_mode": 0,
"right_thumbnail_mode": 0,
"left_thumbnail_mode": 0,
"items": [
{
"path": "/tmp/Roms/psx/Delta.bin",
"label": "Delta",
"core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
"core_name": "Sony - PlayStation (PCSX ReARMed)",
"crc32": "00000000|crc",
"db_name": "Sony - PlayStation.lpl"
},
{
"path": "/tmp/Roms/psx/Charlie.bin",
"label": "Charlie",
"core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
"core_name": "Sony - PlayStation (PCSX ReARMed)",
"crc32": "00000000|crc",
"db_name": "Sony - PlayStation.lpl"
},
{
"path": "/tmp/Roms/psx/Bravo.bin",
"label": "Bravo",
"core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
"core_name": "Sony - PlayStation (PCSX ReARMed)",
"crc32": "00000000|crc",
"db_name": "Sony - PlayStation.lpl"
},
{
"path": "/tmp/Roms/psx/Alpha.bin",
"label": "Alpha",
"core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
"core_name": "Sony - PlayStation (PCSX ReARMed)",
"crc32": "00000000|crc",
"db_name": "Sony - PlayStation.lpl"
}
]
}
That might look heavy up-front, but you can rewrite it as a script file with simpler structure:
from pathlib import Path
data.items = []
for path in (Path('/tmp') / 'Roms' / 'psx').iterdir():
if path.suffix == '.bin':
data.items.append({
"path": str(path),
"label": path.stem,
"core_path": data.default_core_path,
"core_name": data.default_core_name,
"crc32": "00000000|crc",
"db_name": "Sony - PlayStation.lpl",
})
and run it with pyjawk as so:
pyjawk -i 'Sony - PlayStation.lpl' -o 'Sony - PlayStation.lpl' -f script.py
Or instead load it into a repl to work on it in real time with this:
pyjawk -i 'Sony - PlayStation.lpl' -o 'Sony - PlayStation.lpl' -r
>>> from pathlib import Path
>>> data.items = []
>>> for path in (Path('/tmp') / 'Roms' / 'psx').iterdir():
... if path.suffix == '.bin':
... data.items.append({
... "path": str(path),
... "label": path.stem,
... "core_path": data.default_core_path,
... "core_name": data.default_core_name,
... "crc32": "00000000|crc",
... "db_name": "Sony - PlayStation.lpl",
... })
>>> write()
>>> exit()
Just make sure you call write() in the repl, or nothing will be written.
Plans
I don’t plan to add too much to this, as I want it to be useful but also as lean and manageable as it possibly can be. Things like HTTP input and output are best left to other programs that can do it better, like curl, especially given that this program can operate in a streamable fashion.
This program needs some regression tests set up.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file pyjawk-1.2.0-py3-none-any.whl
.
File metadata
- Download URL: pyjawk-1.2.0-py3-none-any.whl
- Upload date:
- Size: 20.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7613ea83378da7f859b1ebfeb1661c524a4fac4b074cfc3471aa4289a6e3d6dd |
|
MD5 | f9925905d4f0a048db817b6c19cccc0a |
|
BLAKE2b-256 | 2fdc08490147c4761dccdfb99b56bc07a329176dea7ede4563a63f6a474106f2 |