Skip to main content

Library to parse JSON files iteratively without loading the whole file into memory

Project description

JSON Lineage

Table of Contents

Introduction

JSON Linage is a tool that allows you to convert JSON to JSONL (JSON Lines) format as well as iteratively parse JSON where the JSON contains a list of objects.

The underlying program is written in Rust and is built to feed one JSON object at a time to the parser. This allows for the parsing of very large JSON files that would otherwise not fit into memory. In addition to saving memory, this program is capable of parsing JSON files faster than the built-in Python JSON parser as the file size increases.

Additionally, this project contains adapters for easy integration into other programming languages. Currently, there is only a Python adapter, but more are planned.

Adapters

Python

The Python adapter is a wrapper around the underlying Rust program. It allows for easy integration into Python programs. It is designed to feel a similar to the built-in json module in Python.

Why not Just Use Python's json Library?

Given that Python already has a built-in JSON parser, you may be wondering why you would want to use this library. The answer is, well, it depends.

If you are parsing a small JSON file, then you probably don't want to use this library.

Python's JSON library is written in C and is very fast. However, as it loads the entire JSON file into memory, it is not suitable for parsing very large JSON files. This is where JSON Lineage comes in.

JSON Lineage is designed to parse very large JSON files that would otherwise not fit into memory. It does this by parsing the JSON file one object at a time.

Functionality

The following functionality is provided:

  • load - Generate an iterator that returns each object in a JSON file.
  • aload - Generates an asynchronous iterator that returns each object in a JSON file.

A CLI is also provided for easy conversion of JSON files to JSONL files. For information on how to use the CLI, run: python -m json_lineage --help.

Benchmarks

The following graphs compare the speed and memory usage of Python's JSON library vs JSON Lineage.

The benchmarks show that up to a file size of 500MB, the speed difference is negligible. However, already at this point, Python requires almost 2GB of memory to parse the JSON file, while JSON Lineage only requires 1.5GB.

As the file size continues to grow, Python's JSON library continues to be faster, but the memory usage continues to grow at a linear rate. JSON Lineage, on the other hand, continues to use the same amount of memory.

Benchmark of difference in time as file size grows

Benchmark of difference in memory as file size grows

Installation

pip install json-lineage

Usage

Iterating over a JSON file
from json_lineage import load

jsonl_iter = load("path/to/file.json")

for obj in jsonl_iter:
    do_something(obj)
Iterating over a JSON file asynchronously
import asyncio
from random import randint
from json_lineage import aload

jsonl_iter = aload("path/to/file.json")


async def do_something(i):
    await asyncio.sleep(randint(1, 2))
    print(i)


async def main():
    tasks = []
    async for i in async_iter:
        tasks.append(asyncio.create_task(do_something(i)))
    
    await asyncio.gather(*tasks)


asyncio.run(main())
Poorly Formatted JSON

When parsing a JSON file, the program will assume that the JSON file is well formatted. If the JSON file is not well formatted, then you can provide a messy=True argument to either the sync or async load:

from json_lineage import load

jsonl_iter = load("path/to/file.json", messy=True)


for obj in jsonl_iter:
    do_something(obj)

This will cause the program to output the same results. However, how it parses the JSON file will be different. Using this option will cause the program to be slower, but it will be able to parse JSON files that are not well formatted.

If you are using the CLI, then you can use the --messy flag to achieve the same result.

Under the Hood

The underlying program is written in Rust. The full documentation for the underlying program can be found here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

json-lineage-0.2.1.tar.gz (3.8 MB view details)

Uploaded Source

Built Distribution

json_lineage-0.2.1-py3-none-any.whl (3.8 MB view details)

Uploaded Python 3

File details

Details for the file json-lineage-0.2.1.tar.gz.

File metadata

  • Download URL: json-lineage-0.2.1.tar.gz
  • Upload date:
  • Size: 3.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for json-lineage-0.2.1.tar.gz
Algorithm Hash digest
SHA256 00e38792d555de8c90d7474184736751a7ba8d51cb645bf02fb61379f221b478
MD5 686fa8971a82d7ee4aaa9cf5ff6d27c7
BLAKE2b-256 c82adff16c795875cd72f80b2e70c6f2432771bb2770c374e8a08bf21f10c3ca

See more details on using hashes here.

File details

Details for the file json_lineage-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for json_lineage-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 db58438ed0188e3279b2bc8b6d0494610b78e08ac4491e0fcef1c8937928d9c4
MD5 eac347d4d5ad5bcc6af268c78ecf55ea
BLAKE2b-256 cd194d2a3cdda0c5f8111180153e8fb68852f0b70f7fc69933ac12b7d2412aa6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page