Skip to main content

O(1) resume for large JSONL streams via byte-offset indexing

Project description

jsonl-resumable

Skip millions of lines in milliseconds.

PyPI version Python 3.10+ License: MIT


Why?

You have a 10GB JSONL file. Your script crashes at line 25 million. Now what?

# Without jsonl-resumable: wait 10 minutes to skip processed lines
for i, line in enumerate(open("huge.jsonl")):
    if i < 25_000_000:
        continue  # 😴
    process(line)
# With jsonl-resumable: resume instantly
from jsonl_resumable import JsonlIndex

index = JsonlIndex("huge.jsonl")
for event in index.iter_json_from(25_000_000):  # ⚡ <1ms
    process(event)

Install

pip install jsonl-resumable

Quick Start

from jsonl_resumable import JsonlIndex

# First run: builds index (~20s for 1GB file)
# Next runs: loads from disk instantly
index = JsonlIndex("events.jsonl")

# Jump to any line in O(1)
event = index.read_json(1_000_000)

# Resume from any point
for event in index.iter_json_from(last_processed):
    process(event)

# File grew? Update index incrementally
index.update()  # Only indexes new lines

That's it. Three methods cover 90% of use cases.


Who is this for?

You're building... Example
LLM data pipelines Processing OpenAI fine-tuning datasets
ETL jobs Resumable data transformations
Log analyzers Jumping to specific timestamps
ML training Random sampling from large datasets

Common thread: Large JSONL files where restarting from scratch is expensive.


API

Core Methods

index = JsonlIndex("data.jsonl")

# Read single line
index.read_json(1000)        # → dict/list (parsed)
index.read_line(1000)        # → str (raw)
index[1000]                  # → str (shorthand)

# Iterate from line N
index.iter_json_from(5000)   # → Iterator[dict|list]
index.iter_from(5000)        # → Iterator[str]

# After appending to file
index.update()               # → int (new lines indexed)

# Metadata
index.total_lines            # → int
index.file_size              # → int (bytes)

Options

JsonlIndex(
    "data.jsonl",
    checkpoint_interval=100,  # Memory vs speed tradeoff
    index_path="custom.idx",  # Where to save index
    auto_save=True,           # Persist after build/update
)

Maintenance

index.rebuild()   # Force full re-index
index.save()      # Manual persist

Incremental Updates

When your JSONL file grows (append-only), don't rebuild the entire index:

index = JsonlIndex("events.jsonl")
print(index.total_lines)  # 1000

# ... your app appends 50 new events ...

new_count = index.update()
print(f"Indexed {new_count} new lines")  # "Indexed 50 new lines"
print(index.total_lines)  # 1050

update() seeks to where the old index ended and only processes new bytes.


How It Works

  1. Build: Scan file once, record byte offset of each line
  2. Persist: Save offsets to {filename}.idx (JSON format)
  3. Seek: Use file.seek(offset) to jump directly to any line
  4. Detect changes: Compare file size + mtime, rebuild if needed

Real-World Patterns

Crash-Resilient Processing

from pathlib import Path
from jsonl_resumable import JsonlIndex

checkpoint = Path("progress.txt")
index = JsonlIndex("events.jsonl")

# Resume from last checkpoint
start = int(checkpoint.read_text()) if checkpoint.exists() else 0

for i, event in enumerate(index.iter_json_from(start), start=start):
    process(event)
    if i % 1000 == 0:
        checkpoint.write_text(str(i))

Random Sampling

import random
from jsonl_resumable import JsonlIndex

index = JsonlIndex("training_data.jsonl")
sample_ids = random.sample(range(index.total_lines), k=1000)
samples = [index.read_json(i) for i in sample_ids]

Tail (Last N Lines)

index = JsonlIndex("logs.jsonl")
for line in index.iter_from(index.total_lines - 100):
    print(line)

Parallel Chunk Processing

from concurrent.futures import ProcessPoolExecutor
from jsonl_resumable import JsonlIndex

def process_range(args):
    path, start, end = args
    index = JsonlIndex(path)
    return [transform(e) for e in index.iter_json_from(start)
            if index._lines[start:end]]

index = JsonlIndex("huge.jsonl")
n_workers = 4
chunk = index.total_lines // n_workers

with ProcessPoolExecutor(n_workers) as ex:
    results = ex.map(process_range, [
        ("huge.jsonl", i * chunk, (i+1) * chunk)
        for i in range(n_workers)
    ])

FAQ

Q: What's JSONL? JSON Lines — each line is a valid JSON object. Used by OpenAI, Hugging Face, and most ML pipelines.

Q: How big is the index file? Roughly 15 bytes per line. A 10M line file → ~150MB index.

Q: What if the file is modified (not just appended)? Call rebuild(). Or just create a new JsonlIndex — it auto-detects changes via file size/mtime.

Q: Thread-safe? Read operations are safe. Don't call update() or rebuild() from multiple threads.

Q: Why not just use linecache? linecache loads the entire file into memory. This library uses byte offsets — constant memory regardless of file size.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonl_resumable-0.1.0.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jsonl_resumable-0.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file jsonl_resumable-0.1.0.tar.gz.

File metadata

  • Download URL: jsonl_resumable-0.1.0.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for jsonl_resumable-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4c2331429f137b28657d5ad8fad0ddbf7e93301327f0b789f7c67b2a528fa45c
MD5 6754a09456816250679525ca9779d645
BLAKE2b-256 8fc7de579775fa2c3a3f3dd28d1aa3f618080f8d2fb5ead7452f6236712e1588

See more details on using hashes here.

File details

Details for the file jsonl_resumable-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for jsonl_resumable-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 728a1d6e46df3b5f83bd5c434a5d09fd1cad1bd42806ac78bbcc8f1a6fa3d19d
MD5 44b2e4676c37074949018a5f0262c60a
BLAKE2b-256 0e177603c53c5b2e8999bcbba1c21c2d50ef7be381816b6598f1c377b637a93d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page