High-performance File System-backed Python Collections (FSList, FSDict, FSSet)

These details have not been verified by PyPI

Project description

fscollections

High-performance, file-backed Python collections (FSList, FSDict, FSSet) that seamlessly mimic the Python standard library interfaces (MutableSequence, MutableMapping, MutableSet) and raise identical exception patterns. By persisting records sequentially in the file system and managing a compact memory-mapped/indexed pointer cache, fscollections scales your data structures far beyond system RAM limits.

THIS PROJECT HAS BEEN VIBE-CODED, USE AT YOUR OWN RISK!

1. Key Production-Grade Features

Standard Library Parity: Perfect API compatibility with list, dict, and set, throwing standard exceptions (IndexError, KeyError, TypeError for non-string dictionary keys).
Temporal & Self-Cleaning Collections: Allows initializing temporary, in-memory-like persistent structures without a context manager by omitting the base_path parameter, automatically deleting backing storage files upon garbage collection (__del__) or explicit .destroy() invocation.
Ultra-Low Memory Footprint: Uses flat 64-bit pointer arrays (array('Q')) and slot-mapping indirection to eliminate Python object wrapper overhead (reducing RAM usage by 3.5x).
Deeply Nested Collections: Supports dynamic nested structures (FSList, FSDict, FSSet) through Virtual Pointer Indirection, achieving $O(1)$ RAM utilization and localized thread-safe writes.
Non-Blocking Native Asynchronous Engine: Asynchronous counterparts (AsyncFSList, AsyncFSDict, AsyncFSSet) utilize aiofiles and cooperative reentrant asynchronous locking to execute file I/O operations concurrently without blocking the event loop.
Offset-Stable Pluggable Cache Layer: Incorporates LRU, FIFO, and LFU in-memory caches utilizing immutable physical record offsets as cache keys, making the cache immune to list index shifts and dictionary updates.
Predicate Pushdown Scans: Enables querying collections with a flat $O(1)$ memory footprint by sequentially matching record deserialization predicates directly on disk.
Exposed System & Storage Telemetry: A detailed .metrics attribute providing dynamic fragmentation ratios, wasted space calculations, physical file sizes, and clean-shutdown detection on demand with zero additional disk writes.
Coalesced Disk Reads: Intelligently identifies physically contiguous runs of records during range slicing in FSList, merging up to 95% of individual seeks into single high-throughput reads.
Swap-and-Pop Auto-Compaction: Offers instant $O(1)$ space reclamation on deletion for fixed-size records, bypassing expensive file rewrites.
Coordinated Layered Concurrency Control: Reentrant thread locking (threading.RLock) coupled with UNIX-level advisory locking (fcntl.flock) ensures multi-thread and multi-process safety, with a graceful thread-lock fallback on Windows.
Crash-Resilient Durability: Features self-healing indexes using an end-of-file Clean-Close marker (0xAA) to automatically reconstruct metadata directly from raw data payloads on startup if an unclean close is detected.
Strict Type Safety: Homogeneous primitive collections enforce strict type validation at runtime to prevent silent serialization bugs, with no disk schema file clutter.

2. Architectural Design & DDD Breakdown

The codebase follows a rigorous Domain-Driven Design (DDD) directory structure. Core business rules, data models, and serialization strategies are fully isolated from storage mechanisms and low-level file I/O operations.

fscollections/
│
├── domain/                      # Domain Layer (Core Business Rules & Interfaces)
│   ├── __init__.py              # Package domain-level exports
│   ├── types.py                 # Registry, Primitive & Struct Serializers
│   ├── interfaces.py            # Storage and index interfaces (IRecordStore, IIndexStore)
│   ├── collections.py           # FSList, FSDict, FSSet implementing abstract base classes (ABCs)
│   └── async_collections.py     # Asynchronous collection decorator proxies (AsyncFSList, etc.)
│
├── infrastructure/              # Infrastructure Layer (I/O, File Storage, Locking, Caching)
│   ├── __init__.py
│   ├── record_store.py          # Low-level sequential binary data store (.data.fscollection)
│   ├── index_store.py           # Flat binary index store (.index.fscollection) with clean-close byte
│   ├── async_record_store.py    # Non-blocking async record store wrapper
│   ├── async_index_store.py     # Non-blocking async index store wrapper
│   ├── cache.py                 # Pluggable memory caching system (LRU, FIFO, LFU)
│   └── locks.py                 # Process and thread coordinated advisory locking utility
│
├── application/                 # Application/Wiring Layer
│   ├── __init__.py
│   ├── factory.py               # Dependency Injection Container for sync collection creation
│   └── async_factory.py         # Dependency Injection Container for async collection creation
│
└── tests/                       # Complete Test Suite
    ├── __init__.py
    ├── unit/                    # 100% Mock-based Unit Tests (mocking immediate lower level)
    └── integration/             # Physical file-system Integration Tests

Decoupled Interfaces & Dependency Injection

Collections rely entirely on abstract interfaces defined in domain/interfaces.py. The application/factory.py and application/async_factory.py act as wiring containers that instantiate the physical infrastructure stores (RecordStore, IndexStore) and inject them into the collections, allowing seamless mock testing of the collections without running expensive file-system writes.

3. Serialization and File Formats

Each collection is defined by a base path (e.g., /path/to/my_collection), creating up to two specialized files:

A. Index File (`.index.fscollection`)

Required for fast lookups in FSDict and FSSet.

Index Entry Layout:

+------------------------+---------------------------+-----------------------+---------------------+
| Key Length L (4 bytes) | UTF-8 String Key (L bytes) | DataOffset (8 bytes)  | DataSize (8 bytes)  |
| Big-Endian uint32 '>I' | Raw UTF-8 bytes           | Big-Endian uint64 '>Q'| Big-Endian uint64 '>Q'|
+------------------------+---------------------------+-----------------------+---------------------+

Clean-Close Marker: The final byte of a cleanly closed index file is written as 0xAA. If a crash occurs and this byte is absent, the library triggers self-healing verification on load.

B. Data File (`.data.fscollection`)

Sequentially appends serialized headers and payloads.

Header Structure (10 bytes, fixed size):
- Type_ID (1 byte)
- Deleted_Flag (1 byte: 0x00 = Active, 0x01 = Deleted)
- Payload_Size (8 bytes, Big-Endian uint64 >Q)
Payload Structure:
- [Raw Serialized Payload Bytes]
- For FSDict and FSSet disaster recovery, the payload contains the key prefixed inside the payload: [4 bytes: Key Length K][K bytes: UTF-8 Key][Serialized Value Bytes].

4. Performance & Structural Complexity

Let $N$ be the number of items currently in the collection, $M$ be the size of active pointers in RAM, and $S$ be the size of the record payload.

Collection Type	Operation	Standard Library (In-Memory)	`fscollections` (File-Backed)	Disk I/O & Execution Details
List / FSList	`list[idx]`	$O(1)$	$O(1)$	1 Random Seek + Read from `.data`
	`list[idx] = val`	$O(1)$	$O(1)$	1 Append-Write + 1 Seek/Write to delete old record
	`list.append(val)`	$O(1)$	$O(1)$	1 Append-Write to `.data`
	`list.insert(idx, val)`	$O(N)$	$O(N)$ pointers	1 Append-Write + pointer shift in RAM
	`list.pop(idx)`	$O(N)$	$O(N)$ pointers	1 Random Seek + Read + 1 Seek/Write to mark deleted
	`len(list)`	$O(1)$	$O(1)$	None (Read from in-memory pointer array)
Dict / FSDict	`dict[key]`	$O(1)$ avg	$O(1)$ avg	1 Random Seek + Read from `.data`
	`dict[key] = val`	$O(1)$ avg	$O(1)$ avg	1 Append-Write + 1 Seek/Write to delete old record
	`key in dict`	$O(1)$ avg	$O(1)$ avg	None (Lookup via in-memory index hash map)
	`del dict[key]`	$O(1)$ avg	$O(1)$ avg	1 Seek/Write to mark record deleted
	`len(dict)`	$O(1)$ avg	$O(1)$	None (Read from in-memory index size)
Set / FSSet	`set.add(val)`	$O(1)$ avg	$O(1)$ avg	None if already exists; 1 Append-Write if new
	`val in set`	$O(1)$ avg	$O(1)$ avg	None (Lookup via in-memory serialized-item hash map)
	`set.remove(val)`	$O(1)$ avg	$O(1)$ avg	1 Seek/Write to mark record deleted
	`len(set)`	$O(1)$ avg	$O(1)$	None (Read from in-memory index size)

5. Detailed Feature Coverage & Examples

The following functional examples illustrate the usage of all advanced capabilities implemented within the library.

A. Temporal & Self-Cleaning Collections

When initializing a collection without an explicit base_path, fscollections automatically configures a unique temporal database inside the OS standard /tmp directory. Backup storage files are deleted automatically when the collection variable goes out of scope and is garbage-collected by the Python runtime. Alternatively, storage can be reclaimed instantly via .destroy(). Subsequent operations on closed or destroyed collections proactively raise a clean ValueError.

from fscollections import fslist

# Allocates temporal storage files under /tmp/fscollection_...
temp_list = fslist(klass='string')
temp_list.append('hello_temporal')

# Reclaim space and delete backing files instantly
temp_list.destroy()

try:
    # Post-destruction operations raise a standard exception
    val = temp_list[0]
except ValueError as error:
    print(f'Captured expected failure: {error}')
    # Output: Captured expected failure: I/O operation on closed file.

B. Extended Serialization Layer

fscollections runs purely homogeneous collections with high-fidelity native data types, compile-time resolution, and runtime type-safety. The supported types can be declared via type objects (e.g. int, str, bool, float, bytes, dict, list, tuple) or explicit string type identifiers:

Declared Type (Object)	Resolved Type (String)	Internal Encoding Format
—	`'int32'`	Big-Endian 32-bit integer
`int`	`'int64'`	Big-Endian 64-bit integer
—	`'int128'`	Big-Endian 128-bit integer
`str`	`'string'`	Length-prefixed UTF-8 string
`bool`	`'bool'`	1-byte boolean representation
`float`	`'float'`	Big-Endian double-precision float
`bytes`	`'bytes'`	Length-prefixed raw bytes buffer
`dict`, `list`, `tuple`	`'json'`	UTF-8 encoded compact JSON

from fscollections import fsdict

# Enforces double-precision float value mappings
float_dict = fsdict('my_floats', klass=float)
float_dict['pi'] = 3.14159

try:
    # Enforces absolute run-time type safety
    float_dict['pi'] = 'three point one four'
except TypeError as error:
    print(f'Linter type block: {error}')
    # Output: Linter type block: Expected float for float, got str

C. Deeply Nested Collections (Virtual Pointer Indirection)

To completely bypass monolithic RAM scaling and write-amplification bottlenecks, nested collection instances (FSList, FSDict, FSSet) are serialized inside the parent collection as a lightweight reference URI string: 'ref://fscollection:<collection_type>?path=<base_path>'. Children are lazily materialized on-the-fly when requested, and any subsequent child mutations write atomically directly to the child's files.

from fscollections import fsdict, fslist

# Create parent heterogeneous collection (klass=None)
with fsdict('parent_store') as parent_dict:
    # Create independent child collection
    child_list = fslist('child_store')
    child_list.append('initial_item')
    
    # Store child inside parent: serializes as reference URI
    parent_dict['nested_list'] = child_list
    
    # Mutation writes directly and atomically to child's files
    parent_dict['nested_list'].append('mutated_item')

# Lazily reloaded from storage
with fsdict('parent_store') as parent_dict:
    # Child collection materializes on-the-fly
    retrieved_list = parent_dict['nested_list']
    print(retrieved_list[1])  # Output: mutated_item

D. Native Asynchronous Engine

For non-blocking asyncio services, async_fslist(), async_fsdict(), and async_fsset() provide native asynchronous APIs. All disk activities run asynchronously via aiofiles and reentrant asynchronous locking, ensuring high-throughput operation concurrency.

import asyncio
from fscollections import async_fsdict, async_fslist

async def run_async_pipeline() -> None:
    async with async_fsdict('async_parent') as parent_dict:
        child_list = async_fslist('async_child')
        await child_list.append('async_payload')
        
        await parent_dict.__setitem__('child_list', child_list)
        
        retrieved_list = await parent_dict['child_list']
        print(await retrieved_list[0])  # Output: async_payload
        
        # Query async metrics
        stats = parent_dict.metrics
        print(f"Total Bytes: {stats['total_disk_usage_bytes']}")

asyncio.run(run_async_pipeline())

E. Offset-Stable Cache Layer

To eliminate disk read latency, collections support a pluggable cache (cache_type set to 'lru', 'fifo', or 'lfu', configured via cache_size). Pointers are cached using the immutable physical record offset in the .data file as the cache key, which is entirely immune to list insertions, dictionary reorderings, or index shifts.

from fscollections import fsdict

# Enables stable LFU cache tracking
cached_dict = fsdict('cached_db', klass='string', cache_type='lfu', cache_size=256)
cached_dict['user_id'] = 'user_1234'

# First read: seeks and reads physical file, caches at immutable offset
user_val = cached_dict['user_id']

# Second read: 100% cache hit, zero file-system I/O overhead!
user_val = cached_dict['user_id']

F. Predicate Pushdown Scans

Instead of loading entire databases into system memory to execute filters, collections support memory-efficient pushdown scan loops. Elements are deserialized and matched sequentially directly on disk, maintaining an $O(1)$ memory footing regardless of database scale.

FSDict: The predicate takes (key, val) and yields (key, val) tuples.
FSList / FSSet: The predicate takes (val) and yields (val) elements.

import asyncio
from fscollections import fsdict, async_fsdict

# Synchronous pushdown scan
sync_dict = fsdict('search_db', klass='int64')
sync_dict['alpha'] = 50
sync_dict['beta'] = 150

# Finds elements on disk, keeping RAM consumption flat
sync_results = list(sync_dict.scan(lambda key, val: val > 100))
print(sync_results)  # Output: [('beta', 150)]


# Asynchronous pushdown scan
async def run_async_scan() -> None:
    async with async_fsdict('async_search_db', klass='int64') as async_dict:
        await async_dict.__setitem__('gamma', 200)
        
        # Async pushdown scan yields elements lazily using AsyncIterator
        async for key, val in async_dict.scan(lambda key, val: val > 150):
            print(f'Async Match: {key} -> {val}')  # Output: Async Match: gamma -> 200

asyncio.run(run_async_scan())

G. Threshold-Driven Compaction

Deletions and updates leave physical fragments ("holes") in the sequential storage. Exposing a compaction_threshold float ratio (e.g. 0.25 for 25% fragmentation) automates background in-place storage compaction whenever fragmentation metrics cross the configured limit. Manual compaction can also be invoked at any time via .compact().

from fscollections import fsdict

# Triggers sequential in-place compaction when fragmentation ratio reaches 20%
compacting_dict = fsdict('metrics_db', klass='string', compaction_threshold=0.2)
compacting_dict['role'] = 'developer'
compacting_dict['role'] = 'senior_developer'  # Leaves 1 dead record slot

# Fragmentation ratio checks can trigger automatic or explicit compaction
compacting_dict.compact()  # Forces immediate storage cleanup and physical truncation

H. Swap-and-Pop Auto-Compaction

For homogeneous, fixed-size primitive arrays, setting auto_compact=True enables instant $O(1)$ physical space reclamation during deletions. Instead of marking the slot as dead or triggering sequential rewrites, it moves the last active physical record directly into the deleted record's offset and truncates the file tail.

from fscollections import fslist

# Fixed-size list structure with pop-and-swap auto compaction
fast_list = fslist('fast_list', klass='int64', auto_compact=True)
fast_list.append(100)
fast_list.append(200)
fast_list.append(300)

# Pop element 1: copies '300' into element 1's offset and truncates file in O(1)
fast_list.pop(1)
print(fast_list[1])  # Output: 300

I. Contiguous Read Coalescing

For sequential range operations in FSList (e.g., slicing my_list[100:1000]), the underlying record store coalesces physically contiguous storage segments. Multiple targets are read sequentially in a single contiguous disk system call, yielding up to a 95% reduction in OS read seek context switches.

from fscollections import fslist

# Range slicing executes automatic read coalescing
sequential_list = fslist('range_list', klass='int32')
for value in range(1000):
    sequential_list.append(value)

# Triggers a single coalesced continuous disk read instead of 100 separate seeks!
subset = sequential_list[100:200]
print(len(subset))  # Output: 100

6. Premium Examples & Applications

The library features six premium, production-grade applications located in the examples/ directory. Each example is thoroughly styled, static type checked, and rigorously verified:

Binary Decision Diagram (BDD): Implements a canonical reduced ordered binary decision diagram (ROBDD) reduction engine, logic SAT solvers, custom DIMACS CNF parsers, and Graphviz DOT graph exports backed by fsdict.
NoSQL Document Database: Utilizes Virtual Pointer Indirection (ref://fscollection...) to orchestrate complex documents, nested indexes, dynamic references, and lock-free collection materializations.
Persistent Job Queue: Builds a resilient Multi-Producer Multi-Consumer (MPMC) worker task queue featuring concurrent process-safe dequeues, status tracking, worker lease locking, and self-cleaning temporal structures.
Log / Vector Search Engine: Demonstrates the performance of Predicate Pushdown Scans (fsdict.scan) by executing textual lookups and vector cosine similarity queries with a flat $O(1)$ memory profile.
Persistent LFU Cache: Develops a size-bounded, high-throughput Least Frequently Used (LFU) key-value cache engine with stable tie-breaking and immediate telemetry metrics.
TODO Manager CLI: A command-line utility showcasing atomic sync additions, positional task reordering, state updates, and clean Temporal collections.

To inspect the examples in detail or review the suite execution details, refer to the examples/README.md walkthrough.

7. Verification and Test Execution

The library is backed by an extensive mock-based unit and physical filesystem integration test suite, fully conforming to strict PEP 484 type hints, formatting, and linting rules.

# Activate virtual environment
source .venv/bin/activate

# Format code
black fscollections

# Run linter checks
ruff check fscollections

# Run strict static type checks
mypy fscollections

# Execute all tests
python -m unittest discover -s fscollections/tests -v

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.0.1

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fscollections-0.0.1.tar.gz (62.8 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fscollections-0.0.1-py3-none-any.whl (75.7 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file fscollections-0.0.1.tar.gz.

File metadata

Download URL: fscollections-0.0.1.tar.gz
Upload date: May 24, 2026
Size: 62.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fscollections-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c7e7507dd47f4697df02f26fc958772f7c0f39f333abe60b0d32d1a3b93dab3f`
MD5	`1ce12094614b958b0bb042959b489c1c`
BLAKE2b-256	`e23cb0edd55af2ed0d037a80a9ccc7684b66b0d463e992c526c1f9371f2feac1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fscollections-0.0.1.tar.gz:

Publisher: publish_on_pypi.yml on diegojromerolopez/fscollections

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fscollections-0.0.1.tar.gz
- Subject digest: c7e7507dd47f4697df02f26fc958772f7c0f39f333abe60b0d32d1a3b93dab3f
- Sigstore transparency entry: 1624705514
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: diegojromerolopez/fscollections@efea50f7c1c8e0c3628ea38a3eadf408e0e60cff
- Branch / Tag: refs/tags/0.0.1
- Owner: https://github.com/diegojromerolopez
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_on_pypi.yml@efea50f7c1c8e0c3628ea38a3eadf408e0e60cff
- Trigger Event: release

File details

Details for the file fscollections-0.0.1-py3-none-any.whl.

File metadata

Download URL: fscollections-0.0.1-py3-none-any.whl
Upload date: May 24, 2026
Size: 75.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fscollections-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8b0366652204ab3a8e8e060e8481e11bf79831d97b588b0653a1165a14e2ed6`
MD5	`5408450666e2c431992421a1d353bbb9`
BLAKE2b-256	`fa8358fa884b5e75f1e27ecaa08a4382d59ade9eb106a9008674f254a7875ee6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fscollections-0.0.1-py3-none-any.whl:

Publisher: publish_on_pypi.yml on diegojromerolopez/fscollections

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fscollections-0.0.1-py3-none-any.whl
- Subject digest: d8b0366652204ab3a8e8e060e8481e11bf79831d97b588b0653a1165a14e2ed6
- Sigstore transparency entry: 1624705521
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: diegojromerolopez/fscollections@efea50f7c1c8e0c3628ea38a3eadf408e0e60cff
- Branch / Tag: refs/tags/0.0.1
- Owner: https://github.com/diegojromerolopez
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_on_pypi.yml@efea50f7c1c8e0c3628ea38a3eadf408e0e60cff
- Trigger Event: release

fscollections 0.0.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

fscollections

1. Key Production-Grade Features

2. Architectural Design & DDD Breakdown

Decoupled Interfaces & Dependency Injection

3. Serialization and File Formats

A. Index File (.index.fscollection)

B. Data File (.data.fscollection)

4. Performance & Structural Complexity

5. Detailed Feature Coverage & Examples

A. Temporal & Self-Cleaning Collections

B. Extended Serialization Layer

C. Deeply Nested Collections (Virtual Pointer Indirection)

D. Native Asynchronous Engine

E. Offset-Stable Cache Layer

F. Predicate Pushdown Scans

G. Threshold-Driven Compaction

H. Swap-and-Pop Auto-Compaction

I. Contiguous Read Coalescing

6. Premium Examples & Applications

7. Verification and Test Execution

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

A. Index File (`.index.fscollection`)

B. Data File (`.data.fscollection`)