Skip to main content

Simple query language to extract tables from JSON.

Project description

json_tabulator

A simple query language for extracting tables from JSON-like objects.

Working with tabular data is much easier than working with nested documents. json-tables helps to extract tables from JSON-like objects in a simple, declarative manner. All further processing is left to the many powerful tools that exist for working with tables, such as Spark or Pandas.

Installation

Install from pypi:

pip install json_tabulator

Quickstart

The json_tabulator module provides tools to extract a JSON document into a set of related tables. Let's start with a simple document

data = {
    'id': 'doc-1',
    'table': [
        {'id': 1, 'name': 'row-1'},
        {'id': 2, 'name': 'row-2'}
    ]
}

The document consists of a document-level value id as well as a nested sub-table table. We want to extract it into a single table, with the global value folded into the table.

To do this, we write a query that defines the conversion into a table like this:

from json_tabulator import tabulate

query = tabulate({
    'document_id': 'id',
    'row_id': 'table.*.id',
    'row_name': 'table.*.name'
})

rows = query.get_rows(data)

This returns an iterator of rows, where each row is a dict {<column_name>: <value>}:

>>> list(rows)
[
    {'document_id': 'doc-1', 'row_id': 1, 'row_name': 'row-1'},
    {'document_id': 'doc-1', 'row_id': 2, 'row_name': 'row-2'}
]

Path Syntax

The syntax for path expressions is similar to JSON Path. A path consists of an optional root element '$' followed by zero or more segments separated by '.'.

Dict key

Can be any string. Key values can be quoted with single or double quotes. Within quoted strings, the quote character must be doubled to escape it. For example, "say ""hello""" -> say hello.

Keys must be quoted if they

  • contain any of the characters *$@.'", or if the
  • contain only digits (these cases would be interpreted as array indices otherwise)

Array index

Array indices are entered as numbers without quotes, e.g. 123. Mostly useful for debugging, usually arrays are iterated over when tabulating data.

Wildcard *

An asterisk * is interpreted as a wildcard. Iterates over dict values or array items. Note that wildcards must be entered explicitly, there is no implicit iteration over arrays.

@key and @path directives.

The @key and @path directives are used to get information about the current key (dict key or array index) or the full path, respectively.

Both must follow after a wildcard, and must be the last segment in the path. For example *.@key is valid, but a.@key and *.@key.b are not.

The output of @path can be used as a primary key within the scope of the parsed object.

Data Extraction

Query Semantics

Values for all attributes in a query are combined into individual rows. Attributes from different parts of the document are combined by "joining" on the lowest common ancestor.

For this reason, all wildcards for all attributes must lie on a common path. Violating this condition would lead to implicit cross joins and the associated data blow-up.

For example, the paths $.a.* and $.b.* cannot be combined because the wildcards are not on the same path. On the other hand, $.a and $.b.*.c can be combined.

Queries are analysed and compiled independent of the data to be queried

If you think you need to get a combination of attributes that is not allowed, think again. If you still think so just run multiple queries and do the join afterwards.

Related Projects

  • jsontable has the same purpose but is not maintained.
  • jsonpath-ng and other jsonpath implementation are much more flexible but more cumbersome to extract nested tables.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

json_tabulator-0.3.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

json_tabulator-0.3.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file json_tabulator-0.3.0.tar.gz.

File metadata

  • Download URL: json_tabulator-0.3.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.9.7 Darwin/22.5.0

File hashes

Hashes for json_tabulator-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c76e210d4d3e0e1b951f5987476d646b8ce24a6164133f559b898b473a5d21d8
MD5 a5a52b0d8799734ebe1a6227b1126480
BLAKE2b-256 3e2e56c6146216913e1b01fd07bb7fa185729fb1e8ca637103990479e324508a

See more details on using hashes here.

File details

Details for the file json_tabulator-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: json_tabulator-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.9.7 Darwin/22.5.0

File hashes

Hashes for json_tabulator-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ffcfdd6d799fe290a306fe4aa24c3da2c28327cc88e11a3d27ceeb770e50d4d1
MD5 a88704f378d036e9fcbf87dba27a39f7
BLAKE2b-256 8d1f3ad9f291bb58f2efdaa635b55b71b1c7b2668251b28269c04ff379cf6427

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page