Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

A Python-based stream editor for json documents

Project description

Python-based stream editor for json files. It is a simple setup that effectively works as a json-parsing awk, similar to jq, but allowing in-place editing and output of json documents as well, and using Python as the working language. It supports colorized output.

Motivation

This program exists for fairly minor convenience, and mostly for my own use. Whenever I end up needing to quickly edit some json, I find myself opening a Python REPL, writing a bunch of obvious loading code to load in the json, work on it a little bit, and then dump it back out to the relevant file. Also, whenever I end up needing to inspect json from a web page, I either curl it to a file and then do the same, or use requests or something to pull it directly in a Python REPL so I can properly inspect it, or I pipe it through Python’s json.tool and less.

This is meant to supplant those use cases entirely for my own uses. If you find it inconvenient to repeatedly undergo the busy work associated with working with or inspecting json data, and especially if you are most familiar and comfortable with the stream-editing way of doing things or spending time in a REPL, this tool might make things a little more convenient for you. It is also useful for inspecting and converting between formats, such as between msgpack and json.

Why not jq?

jq is a really great tool for a lot of what you would use this. I wrote this because jq doesn’t provide the user with a REPL to mangle data, and because Python is a much more powerful and flexible language for the modification process, especially if you want to access the filesystem or other I/O.

jq is a powerful program with a lot of development, active maintenence, maintainers, and its own filter language. If that’s what you want, use that. If you want a simple tool for loading json and working on it with python in either a stream or REPL fashion, this is probably a better fit.

Installation

pip3 install --user pyjawk

Note that pyjawk is Python 3 only.

Use

Display the help text with -h.

In all evaluated python, data represents the parsed input data.

The program is passed an input string either through stdin or a -i argument, and an output through -o or stdout. -f arguments may pass in script files that are run first. The data object is then serialized and output. -e arguments are similar to -f, but run afterward and run as python source text. -c may be used to enable compact output and may be specified multiple times for some output formats. A positional parameter, if present, is evaluated as a python expression and used to replace the data object.

-I and -O may be used to set the input and output formats, respectively.

-n and -N disable input and output respectively.

If -r / --repl is specified, instead of writing output after processing, the function to write to the output is registered in the environment as write, the arguments structure is registered as args, and a ptpython REPL is started up with the same environment.

Multiple command line tools are available, but they all only set the default input and output formats.

Formats

  • json
    • Available as the command line tool pyjawk
    • Supports 3 levels of compactness.
    • Outputs trailing newline except on highest compaction.
    • Supports colorized output.
  • yaml
    • Available as the command line tool pyyawk
    • Supports 3 levels of compactness.
    • Outputs trailing newline
    • Supports colorized output.
  • xml
    • Available as the command line tool pyxawk
    • Parses into a xml.etree.ElementTree.Element object and dumps as xml text. Uses xml.etree.ElementTree.tostring to dump. and if uncompacted, uses xml.dom.minidom to prettify.
    • Supports 2 levels of compactness.
    • Outputs trailing newline
    • Supports colorized output.
  • python
    • Available as the command line tool pypawk
    • Uses eval to pull in objects, and either pprint or repr to dump, depending on compactness.
    • Supports 3 levels of compactness.
    • Outputs trailing newline.
    • Supports colorized output.
  • msgpack
    • Available as the command line tool pymawk
  • string
    • Available as the command line tool pysawk
    • Simply reads input into a string and outputs data as a string, using str on it before dumping.
    • Outputs trailing newline except when compaction is requested.
  • bytes
    • Available as the command line tool pybawk
    • Simply reads input into bytes and outputs data as bytes.

The data object

The data object is much like the ones that are normally imported from the libraries, except all dictionaries for json, yaml, and msgpack are substituted for a subclass of dict that maintains order and gives attribute access to the dictionary storage. All of the following are equivalent:

data["foo"]["bar"][0]["baz"] = "spam"
data.foo.bar[0].baz = "spam"
data.foo.bar[0]["baz"] = "spam"
data["foo"]["bar"][0].baz = "spam"

Note that constructing a dict explicitly will not automatically construct this subclass. You can do that by importing pyjawk.attrdict.AttrDict:

data.foo = {
  "bar": [{"baz": "spam"}],
}
# This will not work.
data.foo.bar[0].baz = "alot"

# The dicts are still ordinary dicts, so this will work
data.foo["bar"][0]["baz"] = "alot"

from pyjawk.attrdict import AttrDict as d

data.foo = d(
  bar=[d(baz="spam")],
)
# Now this will work
data.foo.bar[0].baz = "alot"

Details on IO and arguments in the REPL

In the REPL, the program’s own argument namespace is available as args. Changing some of them is obvious (such as args.output, which is just a string, or args.no_input which is just a boolean), and some others are perhaps non-obvious (args.compact is an integer specifying the number of times it was present). Some of the arguments don’t make any sense to work with (such as args.input and args.input_format, because those are already finished by the time the REPL starts up).

The REPL does not write the output by default. To write the output with the REPL, the write() function must be called explicitly.

When you wish to use the REPL, stdin and stdout must be attached to a terminal. This means that you need to be taking input from a file, not a pipe, and the program may not be piped to anything else. This is necessary because ptpython needs stdin to be communicated with and stdout to communicate back to the user. If you wish to pipe something into pyjawk for REPL use, you’ll have to use a fifo, a temp file, or a process substitution as follows:

# With process redirection
pyjawk -ri <(curl 'https://httpbin.org/get?foo=bar&spam=spam')

# With a temp file
curl 'https://httpbin.org/get?foo=bar&spam=spam' > curltemp.json
pyjawk -ri curltemp.json

# With a fifo
mkfifo curl.fifo
curl 'https://httpbin.org/get?foo=bar&spam=spam' > curl.fifo &
pyjawk -ri curl.fifo

In every case, the data in the repl is:

>>> from pprint import pprint
>>> pprint(data)
{'args': {'foo': 'bar', 'spam': 'spam'},
 'headers': {'Accept': '*/*',
             'Host': 'httpbin.org',
             'User-Agent': 'curl/7.65.0'},
 'origin': '73.169.51.67, 73.169.51.67',
 'url': 'https://httpbin.org/get?foo=bar&spam=spam'}

Examples

Dumping some data to past.ee

$ echo '{"a": "1", "b": null, "c": true, "d": false, "e": 7, "f": 8.5, "g": {"h": [1, 2, 3]}}' \
| pyjawk '{"sections": [{"contents": str(data)}]}' \
| curl -H 'Content-Type: application/json' -H 'X-Auth-Token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' -XPOST --data-binary '@-' https://api.paste.ee/v1/pastes

{"id":"umXKr","link":"https:\/\/paste.ee\/p\/umXKr","success":true}

With this, you can also do any arbitrary string data, and also extract the link from the output if you like:

$ echo this is some test data \
| pyjawk -Istring '{"sections": [{"contents": data}]}' \
| curl -H 'Content-Type: application/json' -H 'X-Auth-Token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' -XPOST --data-binary '@-' https://api.paste.ee/v1/pastes \
| pyjawk -Ostring 'data.link'

https://paste.ee/p/iomJR

Converting data between formats

$ echo '{"foo": "bar", "baz": ["spam", "Spam", {"SPAM?": "SPAM!"}]}' \
| pyjawk -Oyaml

baz:
- spam
- Spam
- SPAM?: SPAM!
foo: bar

Selecting a part of a data-structure with evals

$ echo '{"foo": "bar", "baz": ["spam", "Spam", {"SPAM?": "SPAM!"}]}' \
| pyjawk -c 'data.baz[2]'

{"SPAM?": "SPAM!"}

Extracting a value as a string

$ echo '{"foo": "bar", "baz": ["spam", "Spam", {"SPAM?": "SPAM!"}]}' \
| pyjawk -Ostring 'data.baz[1]'

Spam

Easily embedding string data from stdin into a json structure

$ echo 'this is a test string' \
| pyjawk -Istring -Ojson -c '{"foo": data}'

{"foo": "this is a test string\n"}

Relocating an xml child

$ echo '<root><foo><bar>first</bar></foo><baz /></root>' \
| pyxawk -e 'foo = list(data)[0]; bar = list(foo)[0]; baz = list(data)[1]; baz.append(bar); foo.remove(bar)'
<?xml version="1.0" ?>
<root>
  <foo/>
  <baz>
    <bar>first</bar>
  </baz>
</root>

The -e can also be specified separately:

$ echo '<root><foo><bar>first</bar></foo><baz /></root>' \
| pyxawk -e 'foo = list(data)[0]' -e 'bar = list(foo)[0]' -e 'baz = list(data)[1]' -e 'baz.append(bar)' -e 'foo.remove(bar)'

Or just as a script file:

$ echo '<root><foo><bar>first</bar></foo><baz /></root>' \
| pyxawk -f relocate.py
foo = list(data)[0]
bar = list(foo)[0]
baz = list(data)[1]
baz.append(bar)
foo.remove(bar)

Exploring a structure in a REPL

$ pyjawk -i<(echo '{"foo": "bar", "baz": ["spam", "Spam", {"SPAM?": "SPAM!"}]}') -r
>>> data
{'foo': 'bar', 'baz': ['spam', 'Spam', {'SPAM?': 'SPAM!'}]}

>>> write()
{
  "foo": "bar",
  "baz": [
    "spam",
    "Spam",
    {
      "SPAM?": "SPAM!"
    }
  ]
}

>>> data = data.baz

>>> write()
[
  "spam",
  "Spam",
  {
    "SPAM?": "SPAM!"
  }
]

Fixing Retroarch Playlists

If you had an issue with the way that RetroArch generates its playlist files for the Playstation (by default, it searches for .cue files, but not .bin), and had something like this in /tmp/Roms/psx, all Sony PlayStation games:

Alpha.bin
Alpha.cue
Bravo.bin
Charlie.bin
Delta.bin
Delta.cue

You might end up with a playlist file like this:

{
  "version": "1.2",
  "default_core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
  "default_core_name": "Sony - PlayStation (PCSX ReARMed)",
  "label_display_mode": 0,
  "right_thumbnail_mode": 0,
  "left_thumbnail_mode": 0,
  "items": [
    {
      "path": "/tmp/Roms/psx/Alpha.cue",
      "label": "Alpha",
      "core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
      "core_name": "Sony - PlayStation (PCSX ReARMed)",
      "crc32": "00000000|crc",
      "db_name": "Sony - PlayStation.lpl"
    },
    {
      "path": "/tmp/Roms/psx/Delta.cue",
      "label": "Delta",
      "core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
      "core_name": "Sony - PlayStation (PCSX ReARMed)",
      "crc32": "00000000|crc",
      "db_name": "Sony - PlayStation.lpl"
    }
  ]
}

If you want the file to just have the bins, you can easily scan the directory for these files and modify the json using this tool with this:

$ pyjawk -i 'Sony - PlayStation.lpl' -o 'Sony - PlayStation.lpl' -e 'from pathlib import Path' -e 'data.items = [{"path": str(path), "label": path.stem, "core_path": data.default_core_path, "core_name": data.default_core_name, "crc32": "00000000|crc", "db_name": "Sony - PlayStation.lpl"} for path in (Path("/tmp") / "Roms" / "psx").iterdir() if path.suffix == ".bin"]'

Making the output

{
  "version": "1.2",
  "default_core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
  "default_core_name": "Sony - PlayStation (PCSX ReARMed)",
  "label_display_mode": 0,
  "right_thumbnail_mode": 0,
  "left_thumbnail_mode": 0,
  "items": [
    {
      "path": "/tmp/Roms/psx/Delta.bin",
      "label": "Delta",
      "core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
      "core_name": "Sony - PlayStation (PCSX ReARMed)",
      "crc32": "00000000|crc",
      "db_name": "Sony - PlayStation.lpl"
    },
    {
      "path": "/tmp/Roms/psx/Charlie.bin",
      "label": "Charlie",
      "core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
      "core_name": "Sony - PlayStation (PCSX ReARMed)",
      "crc32": "00000000|crc",
      "db_name": "Sony - PlayStation.lpl"
    },
    {
      "path": "/tmp/Roms/psx/Bravo.bin",
      "label": "Bravo",
      "core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
      "core_name": "Sony - PlayStation (PCSX ReARMed)",
      "crc32": "00000000|crc",
      "db_name": "Sony - PlayStation.lpl"
    },
    {
      "path": "/tmp/Roms/psx/Alpha.bin",
      "label": "Alpha",
      "core_path": "/tmp/retroarch/cores/pcsx_rearmed_libretro.so",
      "core_name": "Sony - PlayStation (PCSX ReARMed)",
      "crc32": "00000000|crc",
      "db_name": "Sony - PlayStation.lpl"
    }
  ]
}

That might look heavy up-front, but you can rewrite it as a script file with simpler structure:

from pathlib import Path

data.items = []

for path in (Path('/tmp') / 'Roms' / 'psx').iterdir():
  if path.suffix == '.bin':
    data.items.append({
         "path": str(path),
         "label": path.stem,
         "core_path": data.default_core_path,
         "core_name": data.default_core_name,
         "crc32": "00000000|crc",
         "db_name": "Sony - PlayStation.lpl",
    })

and run it with pyjawk as so:

pyjawk -i 'Sony - PlayStation.lpl' -o 'Sony - PlayStation.lpl' -f script.py

Or instead load it into a repl to work on it in real time with this:

pyjawk -i 'Sony - PlayStation.lpl' -o 'Sony - PlayStation.lpl' -r
>>> from pathlib import Path

>>> data.items = []

>>> for path in (Path('/tmp') / 'Roms' / 'psx').iterdir():
...     if path.suffix == '.bin':
...         data.items.append({
...             "path": str(path),
...             "label": path.stem,
...             "core_path": data.default_core_path,
...             "core_name": data.default_core_name,
...             "crc32": "00000000|crc",
...             "db_name": "Sony - PlayStation.lpl",
...             })

>>> write()

>>> exit()

Just make sure you call write() in the repl, or nothing will be written.

Plans

I don’t plan to add too much to this, as I want it to be useful but also as lean and manageable as it possibly can be. Things like HTTP input and output are best left to other programs that can do it better, like curl, especially given that this program can operate in a streamable fashion.

This program needs some regression tests set up.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pyjawk, version 1.2.0
Filename, size File type Python version Upload date Hashes
Filename, size pyjawk-1.2.0-py3-none-any.whl (20.3 kB) File type Wheel Python version py3 Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page