O(1) resume for large JSONL streams via byte-offset indexing

These details have not been verified by PyPI

Project links

Project description

jsonl-resumable

Skip millions of lines in milliseconds.

Why?

You have a 10GB JSONL file. Your script crashes at line 25 million. Now what?

# Without jsonl-resumable: wait 10 minutes to skip processed lines
for i, line in enumerate(open("huge.jsonl")):
    if i < 25_000_000:
        continue  # 😴
    process(line)

# With jsonl-resumable: resume instantly
from jsonl_resumable import JsonlIndex

index = JsonlIndex("huge.jsonl")
for event in index.iter_json_from(25_000_000):  # ⚡ <1ms
    process(event)

Install

pip install jsonl-resumable

Quick Start

from jsonl_resumable import JsonlIndex

# First run: builds index (~20s for 1GB file)
# Next runs: loads from disk instantly
index = JsonlIndex("events.jsonl")

# Jump to any line in O(1)
event = index.read_json(1_000_000)

# Resume from any point
for event in index.iter_json_from(last_processed):
    process(event)

# File grew? Update index incrementally
index.update()  # Only indexes new lines

That's it. Three methods cover 90% of use cases.

Who is this for?

You're building...	Example
LLM data pipelines	Processing OpenAI fine-tuning datasets
ETL jobs	Resumable data transformations
Log analyzers	Jumping to specific timestamps
ML training	Random sampling from large datasets

Common thread: Large JSONL files where restarting from scratch is expensive.

API

Core Methods

index = JsonlIndex("data.jsonl")

# Read single line
index.read_json(1000)        # → dict/list (parsed)
index.read_line(1000)        # → str (raw)
index[1000]                  # → str (shorthand)

# Iterate from line N
index.iter_json_from(5000)   # → Iterator[dict|list]
index.iter_from(5000)        # → Iterator[str]

# After appending to file
index.update()               # → int (new lines indexed)

# Metadata
index.total_lines            # → int
index.file_size              # → int (bytes)

Options

JsonlIndex(
    "data.jsonl",
    checkpoint_interval=100,  # Memory vs speed tradeoff
    index_path="custom.idx",  # Where to save index
    auto_save=True,           # Persist after build/update
)

Maintenance

index.rebuild()   # Force full re-index
index.save()      # Manual persist

Incremental Updates

When your JSONL file grows (append-only), don't rebuild the entire index:

index = JsonlIndex("events.jsonl")
print(index.total_lines)  # 1000

# ... your app appends 50 new events ...

new_count = index.update()
print(f"Indexed {new_count} new lines")  # "Indexed 50 new lines"
print(index.total_lines)  # 1050

update() seeks to where the old index ended and only processes new bytes.

How It Works

Build: Scan file once, record byte offset of each line
Persist: Save offsets to {filename}.idx (JSON format)
Seek: Use file.seek(offset) to jump directly to any line
Detect changes: Compare file size + mtime, rebuild if needed

Real-World Patterns

Crash-Resilient Processing

from pathlib import Path
from jsonl_resumable import JsonlIndex

checkpoint = Path("progress.txt")
index = JsonlIndex("events.jsonl")

# Resume from last checkpoint
start = int(checkpoint.read_text()) if checkpoint.exists() else 0

for i, event in enumerate(index.iter_json_from(start), start=start):
    process(event)
    if i % 1000 == 0:
        checkpoint.write_text(str(i))

Random Sampling

import random
from jsonl_resumable import JsonlIndex

index = JsonlIndex("training_data.jsonl")
sample_ids = random.sample(range(index.total_lines), k=1000)
samples = [index.read_json(i) for i in sample_ids]

Tail (Last N Lines)

index = JsonlIndex("logs.jsonl")
for line in index.iter_from(index.total_lines - 100):
    print(line)

Parallel Chunk Processing

from concurrent.futures import ProcessPoolExecutor
from jsonl_resumable import JsonlIndex

def process_range(args):
    path, start, end = args
    index = JsonlIndex(path)
    return [transform(e) for e in index.iter_json_from(start)
            if index._lines[start:end]]

index = JsonlIndex("huge.jsonl")
n_workers = 4
chunk = index.total_lines // n_workers

with ProcessPoolExecutor(n_workers) as ex:
    results = ex.map(process_range, [
        ("huge.jsonl", i * chunk, (i+1) * chunk)
        for i in range(n_workers)
    ])

FAQ

Q: What's JSONL? JSON Lines — each line is a valid JSON object. Used by OpenAI, Hugging Face, and most ML pipelines.

Q: How big is the index file? Roughly 15 bytes per line. A 10M line file → ~150MB index.

Q: What if the file is modified (not just appended)? Call rebuild(). Or just create a new JsonlIndex — it auto-detects changes via file size/mtime.

Q: Thread-safe? Read operations are safe. Don't call update() or rebuild() from multiple threads.

Q: Why not just use linecache? linecache loads the entire file into memory. This library uses byte offsets — constant memory regardless of file size.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Feb 1, 2026

0.4.0

Feb 1, 2026

0.3.0

Feb 1, 2026

0.2.0

Feb 1, 2026

This version

0.1.0

Jan 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonl_resumable-0.1.0.tar.gz (12.2 kB view details)

Uploaded Jan 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jsonl_resumable-0.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Jan 31, 2026 Python 3

File details

Details for the file jsonl_resumable-0.1.0.tar.gz.

File metadata

Download URL: jsonl_resumable-0.1.0.tar.gz
Upload date: Jan 31, 2026
Size: 12.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for jsonl_resumable-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4c2331429f137b28657d5ad8fad0ddbf7e93301327f0b789f7c67b2a528fa45c`
MD5	`6754a09456816250679525ca9779d645`
BLAKE2b-256	`8fc7de579775fa2c3a3f3dd28d1aa3f618080f8d2fb5ead7452f6236712e1588`

See more details on using hashes here.

File details

Details for the file jsonl_resumable-0.1.0-py3-none-any.whl.

File metadata

Download URL: jsonl_resumable-0.1.0-py3-none-any.whl
Upload date: Jan 31, 2026
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for jsonl_resumable-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`728a1d6e46df3b5f83bd5c434a5d09fd1cad1bd42806ac78bbcc8f1a6fa3d19d`
MD5	`44b2e4676c37074949018a5f0262c60a`
BLAKE2b-256	`0e177603c53c5b2e8999bcbba1c21c2d50ef7be381816b6598f1c377b637a93d`

See more details on using hashes here.

jsonl-resumable 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

jsonl-resumable

Why?

Install

Quick Start

Who is this for?

API

Core Methods

Options

Maintenance

Incremental Updates

How It Works

Real-World Patterns

Crash-Resilient Processing

Random Sampling

Tail (Last N Lines)

Parallel Chunk Processing

FAQ

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes