Dict/List backed by JSONL files for large datasets - memory efficient alternatives to dict/list
Project description
JsonLineTypes
Dict/List backed by JSONL files for large datasets - memory efficient alternatives to dict/list
中文说明见 README_CN.md
Overview
JsonLineTypes provides JLFDict and JLFList classes that mimic the behavior of Python's built-in dict and list types, but store data in JSON Lines (JSONL) files on disk instead of memory. This is particularly useful when working with large datasets that don't fit in RAM.
Features
JLFDict
- ✅ Drop-in replacement for dict
- ✅ O(1) lookup, insert, update, delete
- ✅ Supports:
keys(),values(),items(), iteration - ✅ Supports:
get(),pop(),clear(),update() - ✅ Index persistence for fast startup
- ✅ Compact operation to clean up deleted records
JLFList
- ✅ Drop-in replacement for list
- ✅ O(1) append, extend, index access
- ✅ Supports: negative indices, iteration
- ✅ Supports:
pop(),reverse(),clear() - ✅ Index persistence for fast startup
- ✅ Compact operation to clean up deleted records
Installation
pip install jsonlinetypes
Or install from source:
git clone https://github.com/yourusername/jsonlinetypes.git
cd jsonlinetypes
pip install .
Quick Start
JLFDict Usage
from jsonlinetypes import JLFDict
# Create a dict-like object backed by a JSONL file
d = JLFDict("data.jsonl", "id")
# Add data (just like dict)
d[1] = {"id": 1, "name": "Alice"}
d[2] = {"id": 2, "name": "Bob"}
# Access data
print(d[1]) # Output: {'id': 1, 'name': 'Alice'}
# Iterate (just like dict)
for key, value in d.items():
print(key, value)
# Update existing key
d[1] = {"id": 1, "name": "Alice2"}
# Delete a key
del d[2]
# Pop a key
value = d.pop(1, default)
# Batch update
d.update({3: {"id": 3, "name": "Charlie"}})
# Compact to remove deleted records
d.compact()
JLFList Usage
from jsonlinetypes import JLFList
# Create a list-like object backed by a JSONL file
lst = JLFList("items.jsonl")
# Add items (just like list)
lst.append({"name": "Alice"})
# Batch add
lst.extend([{"name": "Bob"}, {"name": "Charlie"}])
# Access by index
print(lst[0]) # Output: {'name': 'Alice'}
print(lst[-1]) # Output: {'name': 'Charlie'}
# Iterate (just like list)
for item in lst:
print(item)
# Update by index
lst[0] = {"name": "Alice2"}
# Delete by index
del lst[1]
# Pop last item
item = lst.pop()
# Reverse in place
lst.reverse()
# Compact to remove deleted records
lst.compact()
How It Works
Storage Format
Data is stored in JSON Lines format (one JSON object per line, separated by newline):
{"id": 1, "name": "Alice"}
{"id": 2, "name": "Bob"}
{"id": 1, "name": "Alice2"}
Indexing
An index file (*.jsonl.idx) is maintained using pickle for O(1) lookups:
- JLFDict: Maps keys to file offsets
- JLFList: Maps indices to file offsets
Delete/Update Strategy
When you delete or update a record:
- A deletion marker is appended to the file
- For updates, a new record is also appended
- Index is updated accordingly
- Call
compact()to clean up (optional, recommended periodically)
Performance
Benchmark Results
(Tested on typical dataset of 1000 records)
| Operation | Time |
|---|---|
| Insert 1000 items | < 1s |
| Read 1000 items | < 0.1s |
| Update 1000 items | < 1s |
| Delete 1000 items | < 1s |
| Compact | < 0.5s |
Memory Usage
- Only index is kept in memory
- ~100 bytes per record (regardless of data size)
- 10 million records ≈ 1GB RAM
Thread Safety Performance
| Version | Performance Overhead | Use Case |
|---|---|---|
| JLFDict (unsafe) | 0% | Single-threaded applications |
| ThreadSafeJLFDict (single op) | 3-8% | Multi-threaded, few operations |
| ThreadSafeJLFDict (batch) | 0-3% | Multi-threaded, batch operations |
Key Finding: Thread-safe version has minimal performance overhead (2.8% average) due to disk I/O being the main bottleneck.
For detailed performance analysis, see PERFORMANCE.md.
Comparison with dict/list
| Feature | dict/list | JLFDict/JLFList |
|---|---|---|
| Memory | All in RAM | Index only |
| Max Size | Limited by RAM | Limited by disk |
| Read Speed | Faster | Fast (seek + read) |
| Write Speed | Faster | Fast (append + index) |
| Persistence | Manual | Automatic |
| Compact | N/A | Required periodically |
| Thread Safety | N/A | Not safe by default |
API Reference
JLFDict
Constructor
JLFDict(file_path, key_field, auto_save_index=True)
file_path: Path to JSONL filekey_field: Field name to use as keyauto_save_index: Automatically save index on changes
Methods
__getitem__(key): Get value by key__setitem__(key, value): Set value by key__delitem__(key): Delete keyget(key, default=None): Get with defaultkeys(): Get keys viewvalues(): Get values viewitems(): Get items viewpop(key, default): Pop and return valueupdate(other): Update with dict/iterableclear(): Clear all datacompact(): Clean up deleted records
JLFList
Constructor
JLFList(file_path, auto_save_index=True)
file_path: Path to JSONL fileauto_save_index: Automatically save index on changes
Methods
__getitem__(index): Get value by index__setitem__(index, value): Set value by index__delitem__(index): Delete by indexappend(value): Append valueextend(values): Extend with iterablepop(index): Pop and return valuereverse(): Reverse in placeclear(): Clear all datacompact(): Clean up deleted records
Best Practices
- Use
compact()periodically - Call after many delete/update operations - Choose appropriate keys - Use unique, stable keys for JLFDict
- Batch operations - Use
update()/extend()for better performance - Monitor file size - Large files still benefit from compaction
- Backup before compact - Compact rewrites the file
Limitations
- No
insert()for JLFList (not supported in append-only format) - Modify operations append to file (call
compact()periodically) - Requires disk I/O (slower than in-memory dict/list)
Requirements
- Python 3.8+
- No external dependencies!
Testing
Run tests with pytest:
# Install dev dependencies
pip install pytest
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_jlf_dict.py
Thread Safety
JLFDict and JLFList are now thread-safe by default.
All operations are automatically protected with a reentrant lock (RLock), making them safe to use in multi-threaded environments without any additional locking.
Basic Usage
from jsonlinetypes import JLFDict, JLFList
# Thread-safe dict
d = JLFDict("data.jsonl", "id")
# All operations are automatically thread-safe
d["key1"] = {"id": "key1", "value": "v1"}
value = d["key1"]
# Thread-safe list
lst = JLFList("items.jsonl")
lst.append({"name": "Alice"})
Batch Operations (Optimized)
For better performance with multiple consecutive operations, use the context manager to lock only once:
# Locks only once for all operations inside
with d:
d["key1"] = value1
d["key2"] = value2
for i in range(100):
d[f"key{i}"] = value
with lst:
for i in range(100):
lst.append({"name": f"Person{i}"})
Multi-threaded Example
import threading
from jsonlinetypes import JLFDict
d = JLFDict("data.jsonl", "id")
def worker(worker_id, num_items):
for i in range(num_items):
d[f"key_{worker_id}_{i}"] = {"id": f"key_{worker_id}_{i}", "value": i}
# Multiple threads can safely access the same JLFDict
threads = []
for i in range(5):
t = threading.Thread(target=worker, args=(i, 100))
threads.append(t)
t.start()
for t in threads:
t.join()
print(f"Total items: {len(d)}") # 500 items, no data loss
Performance
JLFDict and JLFList have minimal performance overhead for thread safety:
- Single operation: ~3-8% overhead (disk I/O is the main bottleneck)
- Batch operations: ~0-3% overhead when using context manager
- Single-threaded use: Still safe, with minimal overhead
For detailed performance analysis, see PERFORMANCE.md.
Run tests: python test_thread_safety.py
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Author
Your Name - your.email@example.com
Acknowledgments
- Inspired by JSON Lines (jsonlines.org)
- Built with Python's collections.abc for duck typing
See Also
- COMPARISON.md - Compare with similar libraries (shelve, tinydb, pandas, etc.)
- USABILITY.md - Usability comparison and ease of use analysis
- INDEX_RECOVERY.md - Index corruption recovery and data restoration
- THREAD_SAFETY.md - Thread safety guide
- PERFORMANCE.md - Performance comparison and benchmarks
- memory_demo.py - Run memory usage demonstration
- usability_demo.py - Run usability comparison demonstration
- benchmark_safety.py - Run performance benchmarks
- jsonlines - JSON Lines specification
- pandas - For data analysis (memory-efficient modes)
Changelog
v0.1.0 (2024)
- Initial release
- JLFDict with full dict-like interface
- JLFList with full list-like interface
- Index persistence
- Compact operation
- Comprehensive test coverage
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jsonlinetypes-0.2.0.tar.gz.
File metadata
- Download URL: jsonlinetypes-0.2.0.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8662a4e998213d914937fc6085c6fc8d042ed4cab314dcf0328cd1da8448197e
|
|
| MD5 |
2d913d7136d8ac7832f4c3466cd4d421
|
|
| BLAKE2b-256 |
b47ce980ee1ce23e909c9c447cf351eecb4fa2eb322c344b8d6af2b1756cc8d9
|
File details
Details for the file jsonlinetypes-0.2.0-py3-none-any.whl.
File metadata
- Download URL: jsonlinetypes-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0742a45e7777df3a0ba7036bd65aae49be252946c9e3a451ff9045c541748f01
|
|
| MD5 |
6f80251dc1d3d2e9ed18e985bf96ef6e
|
|
| BLAKE2b-256 |
7c996e352805113cbf14f14157651818885b04c56f90a223bb1dc3139c7fe404
|