Skip to main content

O(1) resume for large JSONL streams via byte-offset indexing

Project description

jsonl-resumable

Index JSONL files for instant random access and resumable iteration.

PyPI version Python 3.10+ License: MIT

The problem

You have a 10GB JSONL file. Your processing script crashes at line 25 million. To resume, you have to iterate through all 25 million lines you already processed just to get back to where you were:

for i, line in enumerate(open("huge.jsonl")):
    if i < 25_000_000:
        continue  # this takes forever
    process(line)

This library builds a byte-offset index of your file so you can seek directly to any line:

from jsonl_resumable import JsonlIndex

index = JsonlIndex("huge.jsonl")
for event in index.iter_json_from(25_000_000):  # instant
    process(event)

Install

pip install jsonl-resumable

Basic usage

from jsonl_resumable import JsonlIndex

# First run builds the index (takes a while for big files)
# Subsequent runs load it from disk
index = JsonlIndex("events.jsonl")

# Jump to any line
event = index.read_json(1_000_000)

# Iterate from a specific line
for event in index.iter_json_from(last_processed):
    process(event)

# If the file grew, update the index (only scans new bytes)
index.update()

Useful for data pipelines, log analysis, ML training data—anywhere you're dealing with large JSONL files and don't want to start over every time something fails.

API

index = JsonlIndex("data.jsonl")

# Read a single line (parsed or raw)
index.read_json(1000)        # returns dict or list
index.read_line(1000)        # returns raw string
index[1000]                  # same as read_line

# Iterate starting from line N
index.iter_json_from(5000)   # yields parsed JSON
index.iter_from(5000)        # yields raw strings

# When the file grows
index.update()               # indexes new lines, returns count added

# Properties
index.total_lines
index.file_size

Constructor options:

JsonlIndex(
    "data.jsonl",
    checkpoint_interval=100,  # trade memory for speed (lower = more memory)
    index_path="custom.idx",  # custom index file location
    auto_save=True,           # save index to disk after build/update
)

You can also call rebuild() to force a full re-index, or save() to persist manually.

Incremental updates

If you're appending to your JSONL file over time, you don't need to rebuild the whole index:

index = JsonlIndex("events.jsonl")
print(index.total_lines)  # 1000

# ... later, after appending more data ...

new_count = index.update()
print(new_count)          # 50
print(index.total_lines)  # 1050

update() picks up where the index left off and only scans the new bytes.

How it works

The library scans your file once and records the byte offset of each line. These offsets get saved to {filename}.idx. When you want line N, it just does file.seek(offset) instead of reading through the whole file.

If the file's size or modification time changes, it detects that and rebuilds automatically.

Examples

Checkpointing for crash recovery:

from pathlib import Path
from jsonl_resumable import JsonlIndex

checkpoint = Path("progress.txt")
index = JsonlIndex("events.jsonl")

start = int(checkpoint.read_text()) if checkpoint.exists() else 0

for i, event in enumerate(index.iter_json_from(start), start=start):
    process(event)
    if i % 1000 == 0:
        checkpoint.write_text(str(i))

Random sampling:

import random
from jsonl_resumable import JsonlIndex

index = JsonlIndex("training_data.jsonl")
sample_ids = random.sample(range(index.total_lines), k=1000)
samples = [index.read_json(i) for i in sample_ids]

Tail (last N lines):

index = JsonlIndex("logs.jsonl")
for line in index.iter_from(index.total_lines - 100):
    print(line)

FAQ

How big is the index file?

About 15 bytes per line. A 10 million line file produces roughly a 150MB index.

What if the file gets modified (not just appended)?

The library compares file size and mtime. If something changed, it rebuilds. You can also call rebuild() explicitly.

Is it thread-safe?

Reads are fine from multiple threads. Don't call update() or rebuild() concurrently.

Why not linecache?

linecache loads the entire file into memory. This uses byte offsets so memory usage stays constant regardless of file size.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonl_resumable-0.2.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jsonl_resumable-0.2.0-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file jsonl_resumable-0.2.0.tar.gz.

File metadata

  • Download URL: jsonl_resumable-0.2.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jsonl_resumable-0.2.0.tar.gz
Algorithm Hash digest
SHA256 955a565f97313ed86063638cb2246a3ef6ea42e9dd49df93b89170eb16779a7d
MD5 5cd21fb33cb38b519362e9f30b626674
BLAKE2b-256 f9bd3997357d44b125f2eb4bb12beffb0493422c1f22d3ad9ebe349fbe964dad

See more details on using hashes here.

Provenance

The following attestation bundles were made for jsonl_resumable-0.2.0.tar.gz:

Publisher: publish.yml on pranavtotla/jsonl-resumable

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jsonl_resumable-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for jsonl_resumable-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 549a05faf1ac09e5bb6ea4ec22bc97e459721dd781355a7521364ccff6462150
MD5 7a17590b3a0c48d5b9e7c4dba5b192b5
BLAKE2b-256 849ea1276b4d7a391e56b292bcea9f186bc0e9ab46daa7170b27cf20ce56d096

See more details on using hashes here.

Provenance

The following attestation bundles were made for jsonl_resumable-0.2.0-py3-none-any.whl:

Publisher: publish.yml on pranavtotla/jsonl-resumable

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page