O(1) resume for large JSONL streams via byte-offset indexing
Project description
jsonl-resumable
Skip millions of lines in milliseconds.
Why?
You have a 10GB JSONL file. Your script crashes at line 25 million. Now what?
# Without jsonl-resumable: wait 10 minutes to skip processed lines
for i, line in enumerate(open("huge.jsonl")):
if i < 25_000_000:
continue # 😴
process(line)
# With jsonl-resumable: resume instantly
from jsonl_resumable import JsonlIndex
index = JsonlIndex("huge.jsonl")
for event in index.iter_json_from(25_000_000): # ⚡ <1ms
process(event)
Install
pip install jsonl-resumable
Quick Start
from jsonl_resumable import JsonlIndex
# First run: builds index (~20s for 1GB file)
# Next runs: loads from disk instantly
index = JsonlIndex("events.jsonl")
# Jump to any line in O(1)
event = index.read_json(1_000_000)
# Resume from any point
for event in index.iter_json_from(last_processed):
process(event)
# File grew? Update index incrementally
index.update() # Only indexes new lines
That's it. Three methods cover 90% of use cases.
Who is this for?
| You're building... | Example |
|---|---|
| LLM data pipelines | Processing OpenAI fine-tuning datasets |
| ETL jobs | Resumable data transformations |
| Log analyzers | Jumping to specific timestamps |
| ML training | Random sampling from large datasets |
Common thread: Large JSONL files where restarting from scratch is expensive.
API
Core Methods
index = JsonlIndex("data.jsonl")
# Read single line
index.read_json(1000) # → dict/list (parsed)
index.read_line(1000) # → str (raw)
index[1000] # → str (shorthand)
# Iterate from line N
index.iter_json_from(5000) # → Iterator[dict|list]
index.iter_from(5000) # → Iterator[str]
# After appending to file
index.update() # → int (new lines indexed)
# Metadata
index.total_lines # → int
index.file_size # → int (bytes)
Options
JsonlIndex(
"data.jsonl",
checkpoint_interval=100, # Memory vs speed tradeoff
index_path="custom.idx", # Where to save index
auto_save=True, # Persist after build/update
)
Maintenance
index.rebuild() # Force full re-index
index.save() # Manual persist
Incremental Updates
When your JSONL file grows (append-only), don't rebuild the entire index:
index = JsonlIndex("events.jsonl")
print(index.total_lines) # 1000
# ... your app appends 50 new events ...
new_count = index.update()
print(f"Indexed {new_count} new lines") # "Indexed 50 new lines"
print(index.total_lines) # 1050
update() seeks to where the old index ended and only processes new bytes.
How It Works
- Build: Scan file once, record byte offset of each line
- Persist: Save offsets to
{filename}.idx(JSON format) - Seek: Use
file.seek(offset)to jump directly to any line - Detect changes: Compare file size + mtime, rebuild if needed
Real-World Patterns
Crash-Resilient Processing
from pathlib import Path
from jsonl_resumable import JsonlIndex
checkpoint = Path("progress.txt")
index = JsonlIndex("events.jsonl")
# Resume from last checkpoint
start = int(checkpoint.read_text()) if checkpoint.exists() else 0
for i, event in enumerate(index.iter_json_from(start), start=start):
process(event)
if i % 1000 == 0:
checkpoint.write_text(str(i))
Random Sampling
import random
from jsonl_resumable import JsonlIndex
index = JsonlIndex("training_data.jsonl")
sample_ids = random.sample(range(index.total_lines), k=1000)
samples = [index.read_json(i) for i in sample_ids]
Tail (Last N Lines)
index = JsonlIndex("logs.jsonl")
for line in index.iter_from(index.total_lines - 100):
print(line)
Parallel Chunk Processing
from concurrent.futures import ProcessPoolExecutor
from jsonl_resumable import JsonlIndex
def process_range(args):
path, start, end = args
index = JsonlIndex(path)
return [transform(e) for e in index.iter_json_from(start)
if index._lines[start:end]]
index = JsonlIndex("huge.jsonl")
n_workers = 4
chunk = index.total_lines // n_workers
with ProcessPoolExecutor(n_workers) as ex:
results = ex.map(process_range, [
("huge.jsonl", i * chunk, (i+1) * chunk)
for i in range(n_workers)
])
FAQ
Q: What's JSONL? JSON Lines — each line is a valid JSON object. Used by OpenAI, Hugging Face, and most ML pipelines.
Q: How big is the index file? Roughly 15 bytes per line. A 10M line file → ~150MB index.
Q: What if the file is modified (not just appended)?
Call rebuild(). Or just create a new JsonlIndex — it auto-detects changes via file size/mtime.
Q: Thread-safe?
Read operations are safe. Don't call update() or rebuild() from multiple threads.
Q: Why not just use linecache?
linecache loads the entire file into memory. This library uses byte offsets — constant memory regardless of file size.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jsonl_resumable-0.1.0.tar.gz.
File metadata
- Download URL: jsonl_resumable-0.1.0.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c2331429f137b28657d5ad8fad0ddbf7e93301327f0b789f7c67b2a528fa45c
|
|
| MD5 |
6754a09456816250679525ca9779d645
|
|
| BLAKE2b-256 |
8fc7de579775fa2c3a3f3dd28d1aa3f618080f8d2fb5ead7452f6236712e1588
|
File details
Details for the file jsonl_resumable-0.1.0-py3-none-any.whl.
File metadata
- Download URL: jsonl_resumable-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
728a1d6e46df3b5f83bd5c434a5d09fd1cad1bd42806ac78bbcc8f1a6fa3d19d
|
|
| MD5 |
44b2e4676c37074949018a5f0262c60a
|
|
| BLAKE2b-256 |
0e177603c53c5b2e8999bcbba1c21c2d50ef7be381816b6598f1c377b637a93d
|