Skip to main content

Streaming newline delimited JSON I/O

Project description

Streaming newline delimited JSON I/O.

https://travis-ci.org/geowurster/NewlineJSON.svg?branch=master https://coveralls.io/repos/geowurster/NewlineJSON/badge.svg?branch=master

Examples

Read and write files with a single JSON object on every line. See the sample-data directory for valid input examples.

One dictionary per line:

from pprint import pprint
import newlinejson

with open('sample-data/dictionaries.json') as i_f, open('outfile.json', 'r+') as o_f:
    writer = newlinejson.Writer(o_f)
    for line in newlinejson.Reader(i_f):
        writer.write(line)
    o_f.seek(0)
    pprint(newlinejson.load(o_f))
[{'field2': 'l1f2', 'field3': 'l1f3', 'field1': 'l1f1'}
 {'field2': 'l2f2', 'field3': 'l3f3', 'field1': 'l2f1'}
 {'field2': 'l3f2', 'field3': 'l3f3', 'field1': 'l3f1'}
 {'field2': 'l4f2', 'field3': 'l4f3', 'field1': 'l4f1'}
 {'field2': 'l5f2', 'field3': 'l5f3', 'field1': 'l5f1'}]

One list per line:

import newlinejson

with open('sample-data/lists-no-header.json') as f:
    for line in newlinejson.Reader(f):
        print(line)
['l1f2', 'l1f3', 'l1f1']
['l2f2', 'l3f3', 'l2f1']
['l3f2', 'l3f3', 'l3f1']
['l4f2', 'l4f3', 'l4f1']
['l5f2', 'l5f3', 'l5f1']

Mixed content:

import newlinejson

with open('sample-data/mixed-content.json') as f:
    for line in newlinejson.Reader(f):
        print(line)
{'field2': 'l1f2', 'field3': 'l1f3', 'field1': 'l1f1'}
['l1f2', 'l1f3', 'l1f1']
{'field2': 'l2f2', 'field3': 'l3f3', 'field1': 'l2f1'}
['l2f2', 'l3f3', 'l2f1']
{'field2': 'l3f2', 'field3': 'l3f3', 'field1': 'l3f1'}
['l3f2', 'l3f3', 'l3f1']
{'field2': 'l4f2', 'field3': 'l4f3', 'field1': 'l4f1'}
['l4f2', 'l4f3', 'l4f1']
{'field2': 'l5f2', 'field3': 'l5f3', 'field1': 'l5f1'}
['l5f2', 'l5f3', 'l5f1']

The standard JSON functions load/s() and dump/s() are still available but should only be used on small files. The load/s() functions return lists of JSON objects and dump/s() take the the same format as input.

Load from a file:

from pprint import pprint
import newlinejson

with open('sample-data/dictionaries.json') as f:
    pprint(newlinejson.load(f))
[{'field2': 'l1f2', 'field3': 'l1f3', 'field1': 'l1f1'},
 {'field2': 'l2f2', 'field3': 'l3f3', 'field1': 'l2f1'},
 {'field2': 'l3f2', 'field3': 'l3f3', 'field1': 'l3f1'},
 {'field2': 'l4f2', 'field3': 'l4f3', 'field1': 'l4f1'},
 {'field2': 'l5f2', 'field3': 'l5f3', 'field1': 'l5f1'}]

Load from a string:

from pprint import pprint
import newlinejson

with open('sample-data/dictionaries.json') as f:
    pprint(newlinejson.loads(f.read()))
[{'field2': 'l1f2', 'field3': 'l1f3', 'field1': 'l1f1'},
 {'field2': 'l2f2', 'field3': 'l3f3', 'field1': 'l2f1'},
 {'field2': 'l3f2', 'field3': 'l3f3', 'field1': 'l3f1'},
 {'field2': 'l4f2', 'field3': 'l4f3', 'field1': 'l4f1'},
 {'field2': 'l5f2', 'field3': 'l5f3', 'field1': 'l5f1'}]

Dump to a file or a string:

from pprint import pprint
import newlinejson

lines = [
    {'field2': 'l1f2', 'field3': 'l1f3', 'field1': 'l1f1'},
    {'field2': 'l2f2', 'field3': 'l3f3', 'field1': 'l2f1'},
    {'field2': 'l3f2', 'field3': 'l3f3', 'field1': 'l3f1'},
    {'field2': 'l4f2', 'field3': 'l4f3', 'field1': 'l4f1'},
    {'field2': 'l5f2', 'field3': 'l5f3', 'field1': 'l5f1'}
]

with open('output.json', 'r+') as f:
    newlinejson.dump(lines, f)
    f.seek(0)
    pprint(newlinejson.dumps(f.read()))
[{'field2': 'l1f2', 'field3': 'l1f3', 'field1': 'l1f1'},
 {'field2': 'l2f2', 'field3': 'l3f3', 'field1': 'l2f1'},
 {'field2': 'l3f2', 'field3': 'l3f3', 'field1': 'l3f1'},
 {'field2': 'l4f2', 'field3': 'l4f3', 'field1': 'l4f1'},
 {'field2': 'l5f2', 'field3': 'l5f3', 'field1': 'l5f1'}]

Dependencies

NewlineJSON has no dependencies but if Python’s built-in JSON library is too slow it can be used in conjunction with a 3rd party library like ujson or simplejson. When available all unittests are run against json, ujson, simplejson, yajl, and jsonlib2. The internal JSOn library can be specified like so:

import newlinejson
import ujson

newlinejson.JSON = ujson
with open('sample-data/dictionaries.json') as f:
    reader = newlinejson.Reader(f)
    print(reader.json_lib.__name__)
ujson

The library can also be specified for load/s(), dump/s() Reader and Writer via a json_lib keyword argument:

from pprint import pprint
import newlinejson
import ujson

with open('sample-data/dictionaries.json') as f:
    reader = newlinejson.Reader(f, json_lib=ujson)
    print(reader.json_lib.__name__)
ujson

with open('sample-data/dictionaries.json') as f:
    pprint(newlinejson.load(f, json_lib=ujson))
[{'field1': 'l1f1', 'field2': 'l1f2', 'field3': 'l1f3'},
 {'field1': 'l2f1', 'field2': 'l2f2', 'field3': 'l2f3'},
 {'field1': 'l3f1', 'field2': 'l3f2', 'field3': 'l3f3'},
 {'field1': 'l4f1', 'field2': 'l4f2', 'field3': 'l4f3'},
 {'field1': 'l5f1', 'field2': 'l5f2', 'field3': 'l5f3'}]

Installing

Via pip:

$ pip install newlinejson

From master:

$ git clone https://github.com/geowurster/NewlineJSON.git
$ cd NewlineJSON
$ python setup.py install

Developing

Install:

$ pip install virtualenv
$ git clone https://github.com/geowurster/NewlineJSON
$ cd NewlineJSON
$ virtualenv venv
$ source venv/bin/activate
$ pip install -e .
$ nosetests --with-coverage

Profiling

Attempts to profile against: json, jsonlib2, simplejson, ujson, and yajl. A small-ish file is used by default from sample-data but the user can specify any newline delimited JSON file input file as the first argument.

$ ./utils/profile.py

Profiling json ...
  Start time: 23:25:47
  End time: 23:25:49
  Elapsed secs: 1.654891
  Num rows: 10000

Profiling jsonlib2 ...
  Start time: 23:25:49
  End time: 23:25:52
  Elapsed secs: 2.780862
  Num rows: 10000

Profiling simplejson ...
  Start time: 23:25:52
  End time: 23:25:55
  Elapsed secs: 2.905002
  Num rows: 10000

Profiling ujson ...
  Start time: 23:25:55
  End time: 23:25:56
  Elapsed secs: 0.927346
  Num rows: 10000

Profiling yajl ...
  Start time: 23:25:56
  End time: 23:25:58
  Elapsed secs: 2.620200
  Num rows: 10000

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

NewlineJSON-0.2.tar.gz (6.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page