Skip to main content

A simple JSONL reader that caches line byte positions for fast loading.

Project description

fast-jsonl

A very simple library for reading large JSONL files.

This library uses a line-byte cache that must be calculated once per a file and can be used across threads and runtimes until the JSONL file is changed.

Why?

fast-jsonl is intended for data science and machine learning workflows that use large JSONL files that cannot be practically loaded into memory (especially when using multiprocessing).

In most JSONL reading scenarios, fast-jsonl should work well. Some examples of use-cases for fast-jsonl:

  • Random access to one or multiple JSONL files (e.g., PyTorch DataLoaders),
  • Slices of or index-based access to one or multiple JSONL files (e.g., exploratory data analysis), or
  • Combining multiple data files into a single data object.

When to not use fast-jsonl for reading?

  • If your workflow uses only a single small JSONL file, then just loading the data into memory and using directly should be sufficient.
  • If your workflow uses a single large JSONL file, and you will only read data sequentially starting from the first line, we recommend using the jsonlines or orjsonl libraries.
  • If you are working in a production environment and data files are often modified. In this scenario, it would probably be better to invest in a traditional database.

Quickstart

Install

pip install fast-jsonl

Using fast-jsonl

import fast_jsonl as fj

path = "path_to_file.jsonl"
reader = fj.Reader(path)

print(reader[0])  # print a specific line

for line in reader:  # iterate through lines
    print(line)

print(reader[10:20])  # slice the data

fast_jsonl can also read from multiple JSONL files:

import fast_jsonl as fj

paths = ["path_to_file_0.jsonl", "path_to_file_1.jsonl"]
reader = fj.MultiReader(paths)

If the target JSONL file has changed, make sure to generate a new cache file by passing force_cache=True:

import fast_jsonl as fj

reader = fj.Reader("<path-to-changed-file.jsonl>", force_cache=True)

Parameters

Cache file path

By default, a cache is stored the first time Reader is ever called on a specific file and saved at <user-home>/.local/share/fj_cache/<modified-name>/<hash>.cache.json where <usser-home> is the user's home directory, <modified-name> is the target file's path with all directory separators replaced with "--", and <hash> is a hash of the target file. Hashes are used to allow semi-readable cache names (via the modified name) while avoiding potential collisions introduced by replacing the path separator with "--".

A path for the cache file can be specified by passing cache_path to the reader:

import fast_jsonl as fj

path = "path_to_file.jsonl"
reader = fj.Reader(path, cache_path="path-to-cache")

fast-jsonl uses the extension .cache.json for default paths when the cache path is not given, but you are free to specify any extension.

Re-generating a cache

When initializing, the reader first checks if there is already a valid cache file at the expected cache file path (either the default path or an explicit user-passed path).

If a valid cache file exists, the reader will use this file and not generate a new cache file. However, you can override this behavior in one of several ways.

  • Pass force_cache=True to fast_jsonl.Reader():
    • Force the reader to generate a new cache file regardless of whether or not one exists.
  • Pass check_cache_time=True to fast_jsonl.Reader():
    • The reader checks file modification timestamps to see if the data file was modified after caching. If it was, a new cache file is generated.
    • Note that this approach only verifies modification times and does not check if content was actually changed.
  • Pass check_cache_hash=True to fast_jsonl.Reader():
    • The reader checks the file content hash and compares it to the hash saved during initial caching. Different hashes will trigger a re-cache.

Caches can also be re-generated after reader initialization:

reader.recache()

By default, the reader will then re-generate a cache file given the paths passed during reader initialization. If you want to save the new cache in a new location, simply pass a cache_path argument to reader.recache().

TODO

  • Add tests for pre-caching CLI.
  • Add benchmarks code and section to readme.
  • Add multi-threaded caching.
  • Add support for faster JSON backends (ex: orjson)
  • Change slicing to use serial loading to avoid redundant byte seek calls.
  • Allow multi-threaded slicing for faster slice loading.
  • Change slicing to cutoff at 0 and len(reader)-1 so that out of bounds slices behave like builtin lists.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_jsonl-0.1.0.tar.gz (20.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_jsonl-0.1.0-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file fast_jsonl-0.1.0.tar.gz.

File metadata

  • Download URL: fast_jsonl-0.1.0.tar.gz
  • Upload date:
  • Size: 20.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for fast_jsonl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e7b253d6489c2c8c118dbe0ce1d9fc9042cd01cf0c795ecfd9e6b3d20f6a6dcd
MD5 36449e0ef807135e02713fd14c67c778
BLAKE2b-256 5f0f708f6bc72bfbbe716dfd812512c0cf4a0e7edc920cbebe1ae7e3661886f6

See more details on using hashes here.

File details

Details for the file fast_jsonl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fast_jsonl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for fast_jsonl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a4a6048cf3243ace8943d704125aba7b49fda6754bd6435c57d7cfdfad06676e
MD5 9751163b9b5716641b447f964e159a51
BLAKE2b-256 31a04c184f17fd379f7a9ed9eba3b88478d6dfac9ee5843c62710cd89b25f350

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page