Skip to main content

High-performance File System-backed Python Collections (FSList, FSDict, FSSet)

Project description

fscollections

Python Version Code style: black Linter: ruff Typing: mypy License: MIT

High-performance, file-backed Python collections (FSList, FSDict, FSSet) that seamlessly mimic the Python standard library interfaces (MutableSequence, MutableMapping, MutableSet) and raise identical exception patterns. By persisting records sequentially in the file system and managing a compact memory-mapped/indexed pointer cache, fscollections scales your data structures far beyond system RAM limits.

THIS PROJECT HAS BEEN VIBE-CODED, USE AT YOUR OWN RISK!


1. Key Production-Grade Features

  • Standard Library Parity: Perfect API compatibility with list, dict, and set, throwing standard exceptions (IndexError, KeyError, TypeError for non-string dictionary keys).
  • Temporal & Self-Cleaning Collections: Allows initializing temporary, in-memory-like persistent structures without a context manager by omitting the base_path parameter, automatically deleting backing storage files upon garbage collection (__del__) or explicit .destroy() invocation.
  • Ultra-Low Memory Footprint: Uses flat 64-bit pointer arrays (array('Q')) and slot-mapping indirection to eliminate Python object wrapper overhead (reducing RAM usage by 3.5x).
  • Deeply Nested Collections: Supports dynamic nested structures (FSList, FSDict, FSSet) through Virtual Pointer Indirection, achieving $O(1)$ RAM utilization and localized thread-safe writes.
  • Non-Blocking Native Asynchronous Engine: Asynchronous counterparts (AsyncFSList, AsyncFSDict, AsyncFSSet) utilize aiofiles and cooperative reentrant asynchronous locking to execute file I/O operations concurrently without blocking the event loop.
  • Offset-Stable Pluggable Cache Layer: Incorporates LRU, FIFO, and LFU in-memory caches utilizing immutable physical record offsets as cache keys, making the cache immune to list index shifts and dictionary updates.
  • Predicate Pushdown Scans: Enables querying collections with a flat $O(1)$ memory footprint by sequentially matching record deserialization predicates directly on disk.
  • Exposed System & Storage Telemetry: A detailed .metrics attribute providing dynamic fragmentation ratios, wasted space calculations, physical file sizes, and clean-shutdown detection on demand with zero additional disk writes.
  • Coalesced Disk Reads: Intelligently identifies physically contiguous runs of records during range slicing in FSList, merging up to 95% of individual seeks into single high-throughput reads.
  • Swap-and-Pop Auto-Compaction: Offers instant $O(1)$ space reclamation on deletion for fixed-size records, bypassing expensive file rewrites.
  • Coordinated Layered Concurrency Control: Reentrant thread locking (threading.RLock) coupled with UNIX-level advisory locking (fcntl.flock) ensures multi-thread and multi-process safety, with a graceful thread-lock fallback on Windows.
  • Crash-Resilient Durability: Features self-healing indexes using an end-of-file Clean-Close marker (0xAA) to automatically reconstruct metadata directly from raw data payloads on startup if an unclean close is detected.
  • Strict Type Safety: Homogeneous primitive collections enforce strict type validation at runtime to prevent silent serialization bugs, with no disk schema file clutter.

2. Architectural Design & DDD Breakdown

The codebase follows a rigorous Domain-Driven Design (DDD) directory structure. Core business rules, data models, and serialization strategies are fully isolated from storage mechanisms and low-level file I/O operations.

fscollections/
│
├── domain/                      # Domain Layer (Core Business Rules & Interfaces)
│   ├── __init__.py              # Package domain-level exports
│   ├── types.py                 # Registry, Primitive & Struct Serializers
│   ├── interfaces.py            # Storage and index interfaces (IRecordStore, IIndexStore)
│   ├── collections.py           # FSList, FSDict, FSSet implementing abstract base classes (ABCs)
│   └── async_collections.py     # Asynchronous collection decorator proxies (AsyncFSList, etc.)
│
├── infrastructure/              # Infrastructure Layer (I/O, File Storage, Locking, Caching)
│   ├── __init__.py
│   ├── record_store.py          # Low-level sequential binary data store (.data.fscollection)
│   ├── index_store.py           # Flat binary index store (.index.fscollection) with clean-close byte
│   ├── async_record_store.py    # Non-blocking async record store wrapper
│   ├── async_index_store.py     # Non-blocking async index store wrapper
│   ├── cache.py                 # Pluggable memory caching system (LRU, FIFO, LFU)
│   └── locks.py                 # Process and thread coordinated advisory locking utility
│
├── application/                 # Application/Wiring Layer
│   ├── __init__.py
│   ├── factory.py               # Dependency Injection Container for sync collection creation
│   └── async_factory.py         # Dependency Injection Container for async collection creation
│
└── tests/                       # Complete Test Suite
    ├── __init__.py
    ├── unit/                    # 100% Mock-based Unit Tests (mocking immediate lower level)
    └── integration/             # Physical file-system Integration Tests

Decoupled Interfaces & Dependency Injection

Collections rely entirely on abstract interfaces defined in domain/interfaces.py. The application/factory.py and application/async_factory.py act as wiring containers that instantiate the physical infrastructure stores (RecordStore, IndexStore) and inject them into the collections, allowing seamless mock testing of the collections without running expensive file-system writes.


3. Serialization and File Formats

Each collection is defined by a base path (e.g., /path/to/my_collection), creating up to two specialized files:

A. Index File (.index.fscollection)

Required for fast lookups in FSDict and FSSet.

  • Index Entry Layout:
    +------------------------+---------------------------+-----------------------+---------------------+
    | Key Length L (4 bytes) | UTF-8 String Key (L bytes) | DataOffset (8 bytes)  | DataSize (8 bytes)  |
    | Big-Endian uint32 '>I' | Raw UTF-8 bytes           | Big-Endian uint64 '>Q'| Big-Endian uint64 '>Q'|
    +------------------------+---------------------------+-----------------------+---------------------+
    
  • Clean-Close Marker: The final byte of a cleanly closed index file is written as 0xAA. If a crash occurs and this byte is absent, the library triggers self-healing verification on load.

B. Data File (.data.fscollection)

Sequentially appends serialized headers and payloads.

  • Header Structure (10 bytes, fixed size):
    • Type_ID (1 byte)
    • Deleted_Flag (1 byte: 0x00 = Active, 0x01 = Deleted)
    • Payload_Size (8 bytes, Big-Endian uint64 >Q)
  • Payload Structure:
    • [Raw Serialized Payload Bytes]
    • For FSDict and FSSet disaster recovery, the payload contains the key prefixed inside the payload: [4 bytes: Key Length K][K bytes: UTF-8 Key][Serialized Value Bytes].

4. Performance & Structural Complexity

Let $N$ be the number of items currently in the collection, $M$ be the size of active pointers in RAM, and $S$ be the size of the record payload.

Collection Type Operation Standard Library (In-Memory) fscollections (File-Backed) Disk I/O & Execution Details
List / FSList list[idx] $O(1)$ $O(1)$ 1 Random Seek + Read from .data
list[idx] = val $O(1)$ $O(1)$ 1 Append-Write + 1 Seek/Write to delete old record
list.append(val) $O(1)$ $O(1)$ 1 Append-Write to .data
list.insert(idx, val) $O(N)$ $O(N)$ pointers 1 Append-Write + pointer shift in RAM
list.pop(idx) $O(N)$ $O(N)$ pointers 1 Random Seek + Read + 1 Seek/Write to mark deleted
len(list) $O(1)$ $O(1)$ None (Read from in-memory pointer array)
Dict / FSDict dict[key] $O(1)$ avg $O(1)$ avg 1 Random Seek + Read from .data
dict[key] = val $O(1)$ avg $O(1)$ avg 1 Append-Write + 1 Seek/Write to delete old record
key in dict $O(1)$ avg $O(1)$ avg None (Lookup via in-memory index hash map)
del dict[key] $O(1)$ avg $O(1)$ avg 1 Seek/Write to mark record deleted
len(dict) $O(1)$ avg $O(1)$ None (Read from in-memory index size)
Set / FSSet set.add(val) $O(1)$ avg $O(1)$ avg None if already exists; 1 Append-Write if new
val in set $O(1)$ avg $O(1)$ avg None (Lookup via in-memory serialized-item hash map)
set.remove(val) $O(1)$ avg $O(1)$ avg 1 Seek/Write to mark record deleted
len(set) $O(1)$ avg $O(1)$ None (Read from in-memory index size)

5. Detailed Feature Coverage & Examples

The following functional examples illustrate the usage of all advanced capabilities implemented within the library.

A. Temporal & Self-Cleaning Collections

When initializing a collection without an explicit base_path, fscollections automatically configures a unique temporal database inside the OS standard /tmp directory. Backup storage files are deleted automatically when the collection variable goes out of scope and is garbage-collected by the Python runtime. Alternatively, storage can be reclaimed instantly via .destroy(). Subsequent operations on closed or destroyed collections proactively raise a clean ValueError.

from fscollections import fslist

# Allocates temporal storage files under /tmp/fscollection_...
temp_list = fslist(klass='string')
temp_list.append('hello_temporal')

# Reclaim space and delete backing files instantly
temp_list.destroy()

try:
    # Post-destruction operations raise a standard exception
    val = temp_list[0]
except ValueError as error:
    print(f'Captured expected failure: {error}')
    # Output: Captured expected failure: I/O operation on closed file.

B. Extended Serialization Layer

fscollections runs purely homogeneous collections with high-fidelity native data types, compile-time resolution, and runtime type-safety. The supported types can be declared via type objects (e.g. int, str, bool, float, bytes, dict, list, tuple) or explicit string type identifiers:

Declared Type (Object) Resolved Type (String) Internal Encoding Format
'int32' Big-Endian 32-bit integer
int 'int64' Big-Endian 64-bit integer
'int128' Big-Endian 128-bit integer
str 'string' Length-prefixed UTF-8 string
bool 'bool' 1-byte boolean representation
float 'float' Big-Endian double-precision float
bytes 'bytes' Length-prefixed raw bytes buffer
dict, list, tuple 'json' UTF-8 encoded compact JSON
from fscollections import fsdict

# Enforces double-precision float value mappings
float_dict = fsdict('my_floats', klass=float)
float_dict['pi'] = 3.14159

try:
    # Enforces absolute run-time type safety
    float_dict['pi'] = 'three point one four'
except TypeError as error:
    print(f'Linter type block: {error}')
    # Output: Linter type block: Expected float for float, got str

C. Deeply Nested Collections (Virtual Pointer Indirection)

To completely bypass monolithic RAM scaling and write-amplification bottlenecks, nested collection instances (FSList, FSDict, FSSet) are serialized inside the parent collection as a lightweight reference URI string: 'ref://fscollection:<collection_type>?path=<base_path>'. Children are lazily materialized on-the-fly when requested, and any subsequent child mutations write atomically directly to the child's files.

from fscollections import fsdict, fslist

# Create parent heterogeneous collection (klass=None)
with fsdict('parent_store') as parent_dict:
    # Create independent child collection
    child_list = fslist('child_store')
    child_list.append('initial_item')
    
    # Store child inside parent: serializes as reference URI
    parent_dict['nested_list'] = child_list
    
    # Mutation writes directly and atomically to child's files
    parent_dict['nested_list'].append('mutated_item')

# Lazily reloaded from storage
with fsdict('parent_store') as parent_dict:
    # Child collection materializes on-the-fly
    retrieved_list = parent_dict['nested_list']
    print(retrieved_list[1])  # Output: mutated_item

D. Native Asynchronous Engine

For non-blocking asyncio services, async_fslist(), async_fsdict(), and async_fsset() provide native asynchronous APIs. All disk activities run asynchronously via aiofiles and reentrant asynchronous locking, ensuring high-throughput operation concurrency.

import asyncio
from fscollections import async_fsdict, async_fslist

async def run_async_pipeline() -> None:
    async with async_fsdict('async_parent') as parent_dict:
        child_list = async_fslist('async_child')
        await child_list.append('async_payload')
        
        await parent_dict.__setitem__('child_list', child_list)
        
        retrieved_list = await parent_dict['child_list']
        print(await retrieved_list[0])  # Output: async_payload
        
        # Query async metrics
        stats = parent_dict.metrics
        print(f"Total Bytes: {stats['total_disk_usage_bytes']}")

asyncio.run(run_async_pipeline())

E. Offset-Stable Cache Layer

To eliminate disk read latency, collections support a pluggable cache (cache_type set to 'lru', 'fifo', or 'lfu', configured via cache_size). Pointers are cached using the immutable physical record offset in the .data file as the cache key, which is entirely immune to list insertions, dictionary reorderings, or index shifts.

from fscollections import fsdict

# Enables stable LFU cache tracking
cached_dict = fsdict('cached_db', klass='string', cache_type='lfu', cache_size=256)
cached_dict['user_id'] = 'user_1234'

# First read: seeks and reads physical file, caches at immutable offset
user_val = cached_dict['user_id']

# Second read: 100% cache hit, zero file-system I/O overhead!
user_val = cached_dict['user_id']

F. Predicate Pushdown Scans

Instead of loading entire databases into system memory to execute filters, collections support memory-efficient pushdown scan loops. Elements are deserialized and matched sequentially directly on disk, maintaining an $O(1)$ memory footing regardless of database scale.

  • FSDict: The predicate takes (key, val) and yields (key, val) tuples.
  • FSList / FSSet: The predicate takes (val) and yields (val) elements.
import asyncio
from fscollections import fsdict, async_fsdict

# Synchronous pushdown scan
sync_dict = fsdict('search_db', klass='int64')
sync_dict['alpha'] = 50
sync_dict['beta'] = 150

# Finds elements on disk, keeping RAM consumption flat
sync_results = list(sync_dict.scan(lambda key, val: val > 100))
print(sync_results)  # Output: [('beta', 150)]


# Asynchronous pushdown scan
async def run_async_scan() -> None:
    async with async_fsdict('async_search_db', klass='int64') as async_dict:
        await async_dict.__setitem__('gamma', 200)
        
        # Async pushdown scan yields elements lazily using AsyncIterator
        async for key, val in async_dict.scan(lambda key, val: val > 150):
            print(f'Async Match: {key} -> {val}')  # Output: Async Match: gamma -> 200

asyncio.run(run_async_scan())

G. Threshold-Driven Compaction

Deletions and updates leave physical fragments ("holes") in the sequential storage. Exposing a compaction_threshold float ratio (e.g. 0.25 for 25% fragmentation) automates background in-place storage compaction whenever fragmentation metrics cross the configured limit. Manual compaction can also be invoked at any time via .compact().

from fscollections import fsdict

# Triggers sequential in-place compaction when fragmentation ratio reaches 20%
compacting_dict = fsdict('metrics_db', klass='string', compaction_threshold=0.2)
compacting_dict['role'] = 'developer'
compacting_dict['role'] = 'senior_developer'  # Leaves 1 dead record slot

# Fragmentation ratio checks can trigger automatic or explicit compaction
compacting_dict.compact()  # Forces immediate storage cleanup and physical truncation

H. Swap-and-Pop Auto-Compaction

For homogeneous, fixed-size primitive arrays, setting auto_compact=True enables instant $O(1)$ physical space reclamation during deletions. Instead of marking the slot as dead or triggering sequential rewrites, it moves the last active physical record directly into the deleted record's offset and truncates the file tail.

from fscollections import fslist

# Fixed-size list structure with pop-and-swap auto compaction
fast_list = fslist('fast_list', klass='int64', auto_compact=True)
fast_list.append(100)
fast_list.append(200)
fast_list.append(300)

# Pop element 1: copies '300' into element 1's offset and truncates file in O(1)
fast_list.pop(1)
print(fast_list[1])  # Output: 300

I. Contiguous Read Coalescing

For sequential range operations in FSList (e.g., slicing my_list[100:1000]), the underlying record store coalesces physically contiguous storage segments. Multiple targets are read sequentially in a single contiguous disk system call, yielding up to a 95% reduction in OS read seek context switches.

from fscollections import fslist

# Range slicing executes automatic read coalescing
sequential_list = fslist('range_list', klass='int32')
for value in range(1000):
    sequential_list.append(value)

# Triggers a single coalesced continuous disk read instead of 100 separate seeks!
subset = sequential_list[100:200]
print(len(subset))  # Output: 100

6. Premium Examples & Applications

The library features six premium, production-grade applications located in the examples/ directory. Each example is thoroughly styled, static type checked, and rigorously verified:

  1. Binary Decision Diagram (BDD): Implements a canonical reduced ordered binary decision diagram (ROBDD) reduction engine, logic SAT solvers, custom DIMACS CNF parsers, and Graphviz DOT graph exports backed by fsdict.
  2. NoSQL Document Database: Utilizes Virtual Pointer Indirection (ref://fscollection...) to orchestrate complex documents, nested indexes, dynamic references, and lock-free collection materializations.
  3. Persistent Job Queue: Builds a resilient Multi-Producer Multi-Consumer (MPMC) worker task queue featuring concurrent process-safe dequeues, status tracking, worker lease locking, and self-cleaning temporal structures.
  4. Log / Vector Search Engine: Demonstrates the performance of Predicate Pushdown Scans (fsdict.scan) by executing textual lookups and vector cosine similarity queries with a flat $O(1)$ memory profile.
  5. Persistent LFU Cache: Develops a size-bounded, high-throughput Least Frequently Used (LFU) key-value cache engine with stable tie-breaking and immediate telemetry metrics.
  6. TODO Manager CLI: A command-line utility showcasing atomic sync additions, positional task reordering, state updates, and clean Temporal collections.

To inspect the examples in detail or review the suite execution details, refer to the examples/README.md walkthrough.


7. Verification and Test Execution

The library is backed by an extensive mock-based unit and physical filesystem integration test suite, fully conforming to strict PEP 484 type hints, formatting, and linting rules.

# Activate virtual environment
source .venv/bin/activate

# Format code
black fscollections

# Run linter checks
ruff check fscollections

# Run strict static type checks
mypy fscollections

# Execute all tests
python -m unittest discover -s fscollections/tests -v

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fscollections-0.0.1.tar.gz (62.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fscollections-0.0.1-py3-none-any.whl (75.7 kB view details)

Uploaded Python 3

File details

Details for the file fscollections-0.0.1.tar.gz.

File metadata

  • Download URL: fscollections-0.0.1.tar.gz
  • Upload date:
  • Size: 62.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fscollections-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c7e7507dd47f4697df02f26fc958772f7c0f39f333abe60b0d32d1a3b93dab3f
MD5 1ce12094614b958b0bb042959b489c1c
BLAKE2b-256 e23cb0edd55af2ed0d037a80a9ccc7684b66b0d463e992c526c1f9371f2feac1

See more details on using hashes here.

Provenance

The following attestation bundles were made for fscollections-0.0.1.tar.gz:

Publisher: publish_on_pypi.yml on diegojromerolopez/fscollections

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fscollections-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: fscollections-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 75.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fscollections-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d8b0366652204ab3a8e8e060e8481e11bf79831d97b588b0653a1165a14e2ed6
MD5 5408450666e2c431992421a1d353bbb9
BLAKE2b-256 fa8358fa884b5e75f1e27ecaa08a4382d59ade9eb106a9008674f254a7875ee6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fscollections-0.0.1-py3-none-any.whl:

Publisher: publish_on_pypi.yml on diegojromerolopez/fscollections

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page