Skip to main content

Dict/List backed by JSONL files for large datasets - memory efficient alternatives to dict/list

Project description

JsonLineTypes

Dict/List backed by JSONL files for large datasets - memory efficient alternatives to dict/list

中文说明见 README_CN.md

Overview

JsonLineTypes provides JLFDict and JLFList classes that mimic the behavior of Python's built-in dict and list types, but store data in JSON Lines (JSONL) files on disk instead of memory. This is particularly useful when working with large datasets that don't fit in RAM.

Features

JLFDict

  • ✅ Drop-in replacement for dict
  • ✅ O(1) lookup, insert, update, delete
  • ✅ Supports: keys(), values(), items(), iteration
  • ✅ Supports: get(), pop(), clear(), update()
  • ✅ Index persistence for fast startup
  • ✅ Compact operation to clean up deleted records

JLFList

  • ✅ Drop-in replacement for list
  • ✅ O(1) append, extend, index access
  • ✅ Supports: negative indices, iteration
  • ✅ Supports: pop(), reverse(), clear()
  • ✅ Index persistence for fast startup
  • ✅ Compact operation to clean up deleted records

Installation

pip install jsonlinetypes

Or install from source:

git clone https://github.com/yourusername/jsonlinetypes.git
cd jsonlinetypes
pip install .

Quick Start

JLFDict Usage

from jsonlinetypes import JLFDict

# Create a dict-like object backed by a JSONL file
d = JLFDict("data.jsonl", "id")

# Add data (just like dict)
d[1] = {"id": 1, "name": "Alice"}
d[2] = {"id": 2, "name": "Bob"}

# Access data
print(d[1])  # Output: {'id': 1, 'name': 'Alice'}

# Iterate (just like dict)
for key, value in d.items():
    print(key, value)

# Update existing key
d[1] = {"id": 1, "name": "Alice2"}

# Delete a key
del d[2]

# Pop a key
value = d.pop(1, default)

# Batch update
d.update({3: {"id": 3, "name": "Charlie"}})

# Compact to remove deleted records
d.compact()

JLFList Usage

from jsonlinetypes import JLFList

# Create a list-like object backed by a JSONL file
lst = JLFList("items.jsonl")

# Add items (just like list)
lst.append({"name": "Alice"})

# Batch add
lst.extend([{"name": "Bob"}, {"name": "Charlie"}])

# Access by index
print(lst[0])    # Output: {'name': 'Alice'}
print(lst[-1])   # Output: {'name': 'Charlie'}

# Iterate (just like list)
for item in lst:
    print(item)

# Update by index
lst[0] = {"name": "Alice2"}

# Delete by index
del lst[1]

# Pop last item
item = lst.pop()

# Reverse in place
lst.reverse()

# Compact to remove deleted records
lst.compact()

How It Works

Storage Format

Data is stored in JSON Lines format (one JSON object per line, separated by newline):

{"id": 1, "name": "Alice"}
{"id": 2, "name": "Bob"}
{"id": 1, "name": "Alice2"}

Indexing

An index file (*.jsonl.idx) is maintained using pickle for O(1) lookups:

  • JLFDict: Maps keys to file offsets
  • JLFList: Maps indices to file offsets

Delete/Update Strategy

When you delete or update a record:

  1. A deletion marker is appended to the file
  2. For updates, a new record is also appended
  3. Index is updated accordingly
  4. Call compact() to clean up (optional, recommended periodically)

Performance

Benchmark Results

(Tested on typical dataset of 1000 records)

Operation Time
Insert 1000 items < 1s
Read 1000 items < 0.1s
Update 1000 items < 1s
Delete 1000 items < 1s
Compact < 0.5s

Memory Usage

  • Only index is kept in memory
  • ~100 bytes per record (regardless of data size)
  • 10 million records ≈ 1GB RAM

Thread Safety Performance

Version Performance Overhead Use Case
JLFDict (unsafe) 0% Single-threaded applications
ThreadSafeJLFDict (single op) 3-8% Multi-threaded, few operations
ThreadSafeJLFDict (batch) 0-3% Multi-threaded, batch operations

Key Finding: Thread-safe version has minimal performance overhead (2.8% average) due to disk I/O being the main bottleneck.

For detailed performance analysis, see PERFORMANCE.md.

Comparison with dict/list

Feature dict/list JLFDict/JLFList
Memory All in RAM Index only
Max Size Limited by RAM Limited by disk
Read Speed Faster Fast (seek + read)
Write Speed Faster Fast (append + index)
Persistence Manual Automatic
Compact N/A Required periodically
Thread Safety N/A Not safe by default

API Reference

JLFDict

Constructor

JLFDict(file_path, key_field, auto_save_index=True)
  • file_path: Path to JSONL file
  • key_field: Field name to use as key
  • auto_save_index: Automatically save index on changes

Methods

  • __getitem__(key): Get value by key
  • __setitem__(key, value): Set value by key
  • __delitem__(key): Delete key
  • get(key, default=None): Get with default
  • keys(): Get keys view
  • values(): Get values view
  • items(): Get items view
  • pop(key, default): Pop and return value
  • update(other): Update with dict/iterable
  • clear(): Clear all data
  • compact(): Clean up deleted records

JLFList

Constructor

JLFList(file_path, auto_save_index=True)
  • file_path: Path to JSONL file
  • auto_save_index: Automatically save index on changes

Methods

  • __getitem__(index): Get value by index
  • __setitem__(index, value): Set value by index
  • __delitem__(index): Delete by index
  • append(value): Append value
  • extend(values): Extend with iterable
  • pop(index): Pop and return value
  • reverse(): Reverse in place
  • clear(): Clear all data
  • compact(): Clean up deleted records

Best Practices

  1. Use compact() periodically - Call after many delete/update operations
  2. Choose appropriate keys - Use unique, stable keys for JLFDict
  3. Batch operations - Use update()/extend() for better performance
  4. Monitor file size - Large files still benefit from compaction
  5. Backup before compact - Compact rewrites the file

Limitations

  • No insert() for JLFList (not supported in append-only format)
  • Modify operations append to file (call compact() periodically)
  • Requires disk I/O (slower than in-memory dict/list)

Requirements

  • Python 3.8+
  • No external dependencies!

Testing

Run tests with pytest:

# Install dev dependencies
pip install pytest

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_jlf_dict.py

Thread Safety

JLFDict and JLFList are now thread-safe by default.

All operations are automatically protected with a reentrant lock (RLock), making them safe to use in multi-threaded environments without any additional locking.

Basic Usage

from jsonlinetypes import JLFDict, JLFList

# Thread-safe dict
d = JLFDict("data.jsonl", "id")

# All operations are automatically thread-safe
d["key1"] = {"id": "key1", "value": "v1"}
value = d["key1"]

# Thread-safe list
lst = JLFList("items.jsonl")
lst.append({"name": "Alice"})

Batch Operations (Optimized)

For better performance with multiple consecutive operations, use the context manager to lock only once:

# Locks only once for all operations inside
with d:
    d["key1"] = value1
    d["key2"] = value2
    for i in range(100):
        d[f"key{i}"] = value

with lst:
    for i in range(100):
        lst.append({"name": f"Person{i}"})

Multi-threaded Example

import threading
from jsonlinetypes import JLFDict

d = JLFDict("data.jsonl", "id")

def worker(worker_id, num_items):
    for i in range(num_items):
        d[f"key_{worker_id}_{i}"] = {"id": f"key_{worker_id}_{i}", "value": i}

# Multiple threads can safely access the same JLFDict
threads = []
for i in range(5):
    t = threading.Thread(target=worker, args=(i, 100))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print(f"Total items: {len(d)}")  # 500 items, no data loss

Performance

JLFDict and JLFList have minimal performance overhead for thread safety:

  • Single operation: ~3-8% overhead (disk I/O is the main bottleneck)
  • Batch operations: ~0-3% overhead when using context manager
  • Single-threaded use: Still safe, with minimal overhead

For detailed performance analysis, see PERFORMANCE.md.

Run tests: python test_thread_safety.py

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Author

Your Name - your.email@example.com

Acknowledgments

  • Inspired by JSON Lines (jsonlines.org)
  • Built with Python's collections.abc for duck typing

See Also

Changelog

v0.1.0 (2024)

  • Initial release
  • JLFDict with full dict-like interface
  • JLFList with full list-like interface
  • Index persistence
  • Compact operation
  • Comprehensive test coverage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonlinetypes-0.2.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jsonlinetypes-0.2.0-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file jsonlinetypes-0.2.0.tar.gz.

File metadata

  • Download URL: jsonlinetypes-0.2.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for jsonlinetypes-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8662a4e998213d914937fc6085c6fc8d042ed4cab314dcf0328cd1da8448197e
MD5 2d913d7136d8ac7832f4c3466cd4d421
BLAKE2b-256 b47ce980ee1ce23e909c9c447cf351eecb4fa2eb322c344b8d6af2b1756cc8d9

See more details on using hashes here.

File details

Details for the file jsonlinetypes-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: jsonlinetypes-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for jsonlinetypes-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0742a45e7777df3a0ba7036bd65aae49be252946c9e3a451ff9045c541748f01
MD5 6f80251dc1d3d2e9ed18e985bf96ef6e
BLAKE2b-256 7c996e352805113cbf14f14157651818885b04c56f90a223bb1dc3139c7fe404

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page