High-performance File System-backed Python Collections (FSList, FSDict, FSSet)
Project description
fscollections
High-performance, file-backed Python collections (FSList, FSDict, FSSet) that seamlessly mimic the Python standard library interfaces (MutableSequence, MutableMapping, MutableSet) and raise identical exception patterns. By persisting records sequentially in the file system and managing a compact memory-mapped/indexed pointer cache, fscollections scales your data structures far beyond system RAM limits.
THIS PROJECT HAS BEEN VIBE-CODED, USE AT YOUR OWN RISK!
1. Key Production-Grade Features
- Standard Library Parity: Perfect API compatibility with
list,dict, andset, throwing standard exceptions (IndexError,KeyError,TypeErrorfor non-string dictionary keys). - Temporal & Self-Cleaning Collections: Allows initializing temporary, in-memory-like persistent structures without a context manager by omitting the
base_pathparameter, automatically deleting backing storage files upon garbage collection (__del__) or explicit.destroy()invocation. - Ultra-Low Memory Footprint: Uses flat 64-bit pointer arrays (
array('Q')) and slot-mapping indirection to eliminate Python object wrapper overhead (reducing RAM usage by 3.5x). - Deeply Nested Collections: Supports dynamic nested structures (
FSList,FSDict,FSSet) through Virtual Pointer Indirection, achieving $O(1)$ RAM utilization and localized thread-safe writes. - Non-Blocking Native Asynchronous Engine: Asynchronous counterparts (
AsyncFSList,AsyncFSDict,AsyncFSSet) utilizeaiofilesand cooperative reentrant asynchronous locking to execute file I/O operations concurrently without blocking the event loop. - Offset-Stable Pluggable Cache Layer: Incorporates LRU, FIFO, and LFU in-memory caches utilizing immutable physical record offsets as cache keys, making the cache immune to list index shifts and dictionary updates.
- Predicate Pushdown Scans: Enables querying collections with a flat $O(1)$ memory footprint by sequentially matching record deserialization predicates directly on disk.
- Exposed System & Storage Telemetry: A detailed
.metricsattribute providing dynamic fragmentation ratios, wasted space calculations, physical file sizes, and clean-shutdown detection on demand with zero additional disk writes. - Coalesced Disk Reads: Intelligently identifies physically contiguous runs of records during range slicing in
FSList, merging up to 95% of individual seeks into single high-throughput reads. - Swap-and-Pop Auto-Compaction: Offers instant $O(1)$ space reclamation on deletion for fixed-size records, bypassing expensive file rewrites.
- Coordinated Layered Concurrency Control: Reentrant thread locking (
threading.RLock) coupled with UNIX-level advisory locking (fcntl.flock) ensures multi-thread and multi-process safety, with a graceful thread-lock fallback on Windows. - Crash-Resilient Durability: Features self-healing indexes using an end-of-file Clean-Close marker (
0xAA) to automatically reconstruct metadata directly from raw data payloads on startup if an unclean close is detected. - Strict Type Safety: Homogeneous primitive collections enforce strict type validation at runtime to prevent silent serialization bugs, with no disk schema file clutter.
2. Architectural Design & DDD Breakdown
The codebase follows a rigorous Domain-Driven Design (DDD) directory structure. Core business rules, data models, and serialization strategies are fully isolated from storage mechanisms and low-level file I/O operations.
fscollections/
│
├── domain/ # Domain Layer (Core Business Rules & Interfaces)
│ ├── __init__.py # Package domain-level exports
│ ├── types.py # Registry, Primitive & Struct Serializers
│ ├── interfaces.py # Storage and index interfaces (IRecordStore, IIndexStore)
│ ├── collections.py # FSList, FSDict, FSSet implementing abstract base classes (ABCs)
│ └── async_collections.py # Asynchronous collection decorator proxies (AsyncFSList, etc.)
│
├── infrastructure/ # Infrastructure Layer (I/O, File Storage, Locking, Caching)
│ ├── __init__.py
│ ├── record_store.py # Low-level sequential binary data store (.data.fscollection)
│ ├── index_store.py # Flat binary index store (.index.fscollection) with clean-close byte
│ ├── async_record_store.py # Non-blocking async record store wrapper
│ ├── async_index_store.py # Non-blocking async index store wrapper
│ ├── cache.py # Pluggable memory caching system (LRU, FIFO, LFU)
│ └── locks.py # Process and thread coordinated advisory locking utility
│
├── application/ # Application/Wiring Layer
│ ├── __init__.py
│ ├── factory.py # Dependency Injection Container for sync collection creation
│ └── async_factory.py # Dependency Injection Container for async collection creation
│
└── tests/ # Complete Test Suite
├── __init__.py
├── unit/ # 100% Mock-based Unit Tests (mocking immediate lower level)
└── integration/ # Physical file-system Integration Tests
Decoupled Interfaces & Dependency Injection
Collections rely entirely on abstract interfaces defined in domain/interfaces.py. The application/factory.py and application/async_factory.py act as wiring containers that instantiate the physical infrastructure stores (RecordStore, IndexStore) and inject them into the collections, allowing seamless mock testing of the collections without running expensive file-system writes.
3. Serialization and File Formats
Each collection is defined by a base path (e.g., /path/to/my_collection), creating up to two specialized files:
A. Index File (.index.fscollection)
Required for fast lookups in FSDict and FSSet.
- Index Entry Layout:
+------------------------+---------------------------+-----------------------+---------------------+ | Key Length L (4 bytes) | UTF-8 String Key (L bytes) | DataOffset (8 bytes) | DataSize (8 bytes) | | Big-Endian uint32 '>I' | Raw UTF-8 bytes | Big-Endian uint64 '>Q'| Big-Endian uint64 '>Q'| +------------------------+---------------------------+-----------------------+---------------------+ - Clean-Close Marker: The final byte of a cleanly closed index file is written as
0xAA. If a crash occurs and this byte is absent, the library triggers self-healing verification on load.
B. Data File (.data.fscollection)
Sequentially appends serialized headers and payloads.
- Header Structure (10 bytes, fixed size):
Type_ID(1 byte)Deleted_Flag(1 byte:0x00= Active,0x01= Deleted)Payload_Size(8 bytes, Big-Endian uint64>Q)
- Payload Structure:
[Raw Serialized Payload Bytes]- For
FSDictandFSSetdisaster recovery, the payload contains the key prefixed inside the payload:[4 bytes: Key Length K][K bytes: UTF-8 Key][Serialized Value Bytes].
4. Performance & Structural Complexity
Let $N$ be the number of items currently in the collection, $M$ be the size of active pointers in RAM, and $S$ be the size of the record payload.
| Collection Type | Operation | Standard Library (In-Memory) | fscollections (File-Backed) |
Disk I/O & Execution Details |
|---|---|---|---|---|
| List / FSList | list[idx] |
$O(1)$ | $O(1)$ | 1 Random Seek + Read from .data |
list[idx] = val |
$O(1)$ | $O(1)$ | 1 Append-Write + 1 Seek/Write to delete old record | |
list.append(val) |
$O(1)$ | $O(1)$ | 1 Append-Write to .data |
|
list.insert(idx, val) |
$O(N)$ | $O(N)$ pointers | 1 Append-Write + pointer shift in RAM | |
list.pop(idx) |
$O(N)$ | $O(N)$ pointers | 1 Random Seek + Read + 1 Seek/Write to mark deleted | |
len(list) |
$O(1)$ | $O(1)$ | None (Read from in-memory pointer array) | |
| Dict / FSDict | dict[key] |
$O(1)$ avg | $O(1)$ avg | 1 Random Seek + Read from .data |
dict[key] = val |
$O(1)$ avg | $O(1)$ avg | 1 Append-Write + 1 Seek/Write to delete old record | |
key in dict |
$O(1)$ avg | $O(1)$ avg | None (Lookup via in-memory index hash map) | |
del dict[key] |
$O(1)$ avg | $O(1)$ avg | 1 Seek/Write to mark record deleted | |
len(dict) |
$O(1)$ avg | $O(1)$ | None (Read from in-memory index size) | |
| Set / FSSet | set.add(val) |
$O(1)$ avg | $O(1)$ avg | None if already exists; 1 Append-Write if new |
val in set |
$O(1)$ avg | $O(1)$ avg | None (Lookup via in-memory serialized-item hash map) | |
set.remove(val) |
$O(1)$ avg | $O(1)$ avg | 1 Seek/Write to mark record deleted | |
len(set) |
$O(1)$ avg | $O(1)$ | None (Read from in-memory index size) |
5. Detailed Feature Coverage & Examples
The following functional examples illustrate the usage of all advanced capabilities implemented within the library.
A. Temporal & Self-Cleaning Collections
When initializing a collection without an explicit base_path, fscollections automatically configures a unique temporal database inside the OS standard /tmp directory. Backup storage files are deleted automatically when the collection variable goes out of scope and is garbage-collected by the Python runtime. Alternatively, storage can be reclaimed instantly via .destroy(). Subsequent operations on closed or destroyed collections proactively raise a clean ValueError.
from fscollections import fslist
# Allocates temporal storage files under /tmp/fscollection_...
temp_list = fslist(klass='string')
temp_list.append('hello_temporal')
# Reclaim space and delete backing files instantly
temp_list.destroy()
try:
# Post-destruction operations raise a standard exception
val = temp_list[0]
except ValueError as error:
print(f'Captured expected failure: {error}')
# Output: Captured expected failure: I/O operation on closed file.
B. Extended Serialization Layer
fscollections runs purely homogeneous collections with high-fidelity native data types, compile-time resolution, and runtime type-safety.
The supported types can be declared via type objects (e.g. int, str, bool, float, bytes, dict, list, tuple) or explicit string type identifiers:
| Declared Type (Object) | Resolved Type (String) | Internal Encoding Format |
|---|---|---|
| — | 'int32' |
Big-Endian 32-bit integer |
int |
'int64' |
Big-Endian 64-bit integer |
| — | 'int128' |
Big-Endian 128-bit integer |
str |
'string' |
Length-prefixed UTF-8 string |
bool |
'bool' |
1-byte boolean representation |
float |
'float' |
Big-Endian double-precision float |
bytes |
'bytes' |
Length-prefixed raw bytes buffer |
dict, list, tuple |
'json' |
UTF-8 encoded compact JSON |
from fscollections import fsdict
# Enforces double-precision float value mappings
float_dict = fsdict('my_floats', klass=float)
float_dict['pi'] = 3.14159
try:
# Enforces absolute run-time type safety
float_dict['pi'] = 'three point one four'
except TypeError as error:
print(f'Linter type block: {error}')
# Output: Linter type block: Expected float for float, got str
C. Deeply Nested Collections (Virtual Pointer Indirection)
To completely bypass monolithic RAM scaling and write-amplification bottlenecks, nested collection instances (FSList, FSDict, FSSet) are serialized inside the parent collection as a lightweight reference URI string: 'ref://fscollection:<collection_type>?path=<base_path>'. Children are lazily materialized on-the-fly when requested, and any subsequent child mutations write atomically directly to the child's files.
from fscollections import fsdict, fslist
# Create parent heterogeneous collection (klass=None)
with fsdict('parent_store') as parent_dict:
# Create independent child collection
child_list = fslist('child_store')
child_list.append('initial_item')
# Store child inside parent: serializes as reference URI
parent_dict['nested_list'] = child_list
# Mutation writes directly and atomically to child's files
parent_dict['nested_list'].append('mutated_item')
# Lazily reloaded from storage
with fsdict('parent_store') as parent_dict:
# Child collection materializes on-the-fly
retrieved_list = parent_dict['nested_list']
print(retrieved_list[1]) # Output: mutated_item
D. Native Asynchronous Engine
For non-blocking asyncio services, async_fslist(), async_fsdict(), and async_fsset() provide native asynchronous APIs. All disk activities run asynchronously via aiofiles and reentrant asynchronous locking, ensuring high-throughput operation concurrency.
import asyncio
from fscollections import async_fsdict, async_fslist
async def run_async_pipeline() -> None:
async with async_fsdict('async_parent') as parent_dict:
child_list = async_fslist('async_child')
await child_list.append('async_payload')
await parent_dict.__setitem__('child_list', child_list)
retrieved_list = await parent_dict['child_list']
print(await retrieved_list[0]) # Output: async_payload
# Query async metrics
stats = parent_dict.metrics
print(f"Total Bytes: {stats['total_disk_usage_bytes']}")
asyncio.run(run_async_pipeline())
E. Offset-Stable Cache Layer
To eliminate disk read latency, collections support a pluggable cache (cache_type set to 'lru', 'fifo', or 'lfu', configured via cache_size). Pointers are cached using the immutable physical record offset in the .data file as the cache key, which is entirely immune to list insertions, dictionary reorderings, or index shifts.
from fscollections import fsdict
# Enables stable LFU cache tracking
cached_dict = fsdict('cached_db', klass='string', cache_type='lfu', cache_size=256)
cached_dict['user_id'] = 'user_1234'
# First read: seeks and reads physical file, caches at immutable offset
user_val = cached_dict['user_id']
# Second read: 100% cache hit, zero file-system I/O overhead!
user_val = cached_dict['user_id']
F. Predicate Pushdown Scans
Instead of loading entire databases into system memory to execute filters, collections support memory-efficient pushdown scan loops. Elements are deserialized and matched sequentially directly on disk, maintaining an $O(1)$ memory footing regardless of database scale.
- FSDict: The predicate takes
(key, val)and yields(key, val)tuples. - FSList / FSSet: The predicate takes
(val)and yields(val)elements.
import asyncio
from fscollections import fsdict, async_fsdict
# Synchronous pushdown scan
sync_dict = fsdict('search_db', klass='int64')
sync_dict['alpha'] = 50
sync_dict['beta'] = 150
# Finds elements on disk, keeping RAM consumption flat
sync_results = list(sync_dict.scan(lambda key, val: val > 100))
print(sync_results) # Output: [('beta', 150)]
# Asynchronous pushdown scan
async def run_async_scan() -> None:
async with async_fsdict('async_search_db', klass='int64') as async_dict:
await async_dict.__setitem__('gamma', 200)
# Async pushdown scan yields elements lazily using AsyncIterator
async for key, val in async_dict.scan(lambda key, val: val > 150):
print(f'Async Match: {key} -> {val}') # Output: Async Match: gamma -> 200
asyncio.run(run_async_scan())
G. Threshold-Driven Compaction
Deletions and updates leave physical fragments ("holes") in the sequential storage. Exposing a compaction_threshold float ratio (e.g. 0.25 for 25% fragmentation) automates background in-place storage compaction whenever fragmentation metrics cross the configured limit. Manual compaction can also be invoked at any time via .compact().
from fscollections import fsdict
# Triggers sequential in-place compaction when fragmentation ratio reaches 20%
compacting_dict = fsdict('metrics_db', klass='string', compaction_threshold=0.2)
compacting_dict['role'] = 'developer'
compacting_dict['role'] = 'senior_developer' # Leaves 1 dead record slot
# Fragmentation ratio checks can trigger automatic or explicit compaction
compacting_dict.compact() # Forces immediate storage cleanup and physical truncation
H. Swap-and-Pop Auto-Compaction
For homogeneous, fixed-size primitive arrays, setting auto_compact=True enables instant $O(1)$ physical space reclamation during deletions. Instead of marking the slot as dead or triggering sequential rewrites, it moves the last active physical record directly into the deleted record's offset and truncates the file tail.
from fscollections import fslist
# Fixed-size list structure with pop-and-swap auto compaction
fast_list = fslist('fast_list', klass='int64', auto_compact=True)
fast_list.append(100)
fast_list.append(200)
fast_list.append(300)
# Pop element 1: copies '300' into element 1's offset and truncates file in O(1)
fast_list.pop(1)
print(fast_list[1]) # Output: 300
I. Contiguous Read Coalescing
For sequential range operations in FSList (e.g., slicing my_list[100:1000]), the underlying record store coalesces physically contiguous storage segments. Multiple targets are read sequentially in a single contiguous disk system call, yielding up to a 95% reduction in OS read seek context switches.
from fscollections import fslist
# Range slicing executes automatic read coalescing
sequential_list = fslist('range_list', klass='int32')
for value in range(1000):
sequential_list.append(value)
# Triggers a single coalesced continuous disk read instead of 100 separate seeks!
subset = sequential_list[100:200]
print(len(subset)) # Output: 100
6. Premium Examples & Applications
The library features six premium, production-grade applications located in the examples/ directory. Each example is thoroughly styled, static type checked, and rigorously verified:
- Binary Decision Diagram (BDD): Implements a canonical reduced ordered binary decision diagram (ROBDD) reduction engine, logic SAT solvers, custom DIMACS CNF parsers, and Graphviz DOT graph exports backed by
fsdict. - NoSQL Document Database: Utilizes Virtual Pointer Indirection (
ref://fscollection...) to orchestrate complex documents, nested indexes, dynamic references, and lock-free collection materializations. - Persistent Job Queue: Builds a resilient Multi-Producer Multi-Consumer (MPMC) worker task queue featuring concurrent process-safe dequeues, status tracking, worker lease locking, and self-cleaning temporal structures.
- Log / Vector Search Engine: Demonstrates the performance of Predicate Pushdown Scans (
fsdict.scan) by executing textual lookups and vector cosine similarity queries with a flat $O(1)$ memory profile. - Persistent LFU Cache: Develops a size-bounded, high-throughput Least Frequently Used (LFU) key-value cache engine with stable tie-breaking and immediate telemetry metrics.
- TODO Manager CLI: A command-line utility showcasing atomic sync additions, positional task reordering, state updates, and clean Temporal collections.
To inspect the examples in detail or review the suite execution details, refer to the examples/README.md walkthrough.
7. Verification and Test Execution
The library is backed by an extensive mock-based unit and physical filesystem integration test suite, fully conforming to strict PEP 484 type hints, formatting, and linting rules.
# Activate virtual environment
source .venv/bin/activate
# Format code
black fscollections
# Run linter checks
ruff check fscollections
# Run strict static type checks
mypy fscollections
# Execute all tests
python -m unittest discover -s fscollections/tests -v
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fscollections-0.0.1.tar.gz.
File metadata
- Download URL: fscollections-0.0.1.tar.gz
- Upload date:
- Size: 62.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7e7507dd47f4697df02f26fc958772f7c0f39f333abe60b0d32d1a3b93dab3f
|
|
| MD5 |
1ce12094614b958b0bb042959b489c1c
|
|
| BLAKE2b-256 |
e23cb0edd55af2ed0d037a80a9ccc7684b66b0d463e992c526c1f9371f2feac1
|
Provenance
The following attestation bundles were made for fscollections-0.0.1.tar.gz:
Publisher:
publish_on_pypi.yml on diegojromerolopez/fscollections
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fscollections-0.0.1.tar.gz -
Subject digest:
c7e7507dd47f4697df02f26fc958772f7c0f39f333abe60b0d32d1a3b93dab3f - Sigstore transparency entry: 1624705514
- Sigstore integration time:
-
Permalink:
diegojromerolopez/fscollections@efea50f7c1c8e0c3628ea38a3eadf408e0e60cff -
Branch / Tag:
refs/tags/0.0.1 - Owner: https://github.com/diegojromerolopez
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_on_pypi.yml@efea50f7c1c8e0c3628ea38a3eadf408e0e60cff -
Trigger Event:
release
-
Statement type:
File details
Details for the file fscollections-0.0.1-py3-none-any.whl.
File metadata
- Download URL: fscollections-0.0.1-py3-none-any.whl
- Upload date:
- Size: 75.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8b0366652204ab3a8e8e060e8481e11bf79831d97b588b0653a1165a14e2ed6
|
|
| MD5 |
5408450666e2c431992421a1d353bbb9
|
|
| BLAKE2b-256 |
fa8358fa884b5e75f1e27ecaa08a4382d59ade9eb106a9008674f254a7875ee6
|
Provenance
The following attestation bundles were made for fscollections-0.0.1-py3-none-any.whl:
Publisher:
publish_on_pypi.yml on diegojromerolopez/fscollections
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fscollections-0.0.1-py3-none-any.whl -
Subject digest:
d8b0366652204ab3a8e8e060e8481e11bf79831d97b588b0653a1165a14e2ed6 - Sigstore transparency entry: 1624705521
- Sigstore integration time:
-
Permalink:
diegojromerolopez/fscollections@efea50f7c1c8e0c3628ea38a3eadf408e0e60cff -
Branch / Tag:
refs/tags/0.0.1 - Owner: https://github.com/diegojromerolopez
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_on_pypi.yml@efea50f7c1c8e0c3628ea38a3eadf408e0e60cff -
Trigger Event:
release
-
Statement type: