Skip to main content

Sparse Virtual File System Cache implemented in C++.

Project description

Sparse Virtual File

Introduction

Sometimes you don’t need the whole file. Sometimes you don’t want the whole file. Especially if it is huge and on some remote server. But, you might know what parts of the file that you want and svfsc can help you store them locally so it looks as if you have access to the complete file but with just the pieces of interest.

svfsc is targeted at reading very large binary files such as TIFF, RP66V1, HDF5 where the structure is well known. For example you might want to parse a TIFF file for its metadata or for a particular image tile or strip which is a tiny fraction of the file itself.

svfsc implements a Sparse Virtual File, a specialised in-memory cache where a particular file might not be available but parts of it can be obtained without reading the whole file. A Sparse Virtual File (SVF) is represented internally as a map of blocks of data with the key being their file offsets. Any write to an SVF will coalesce these blocks where possible. There is no cache punting strategy implemented so an SVF always accumulates data. A Sparse Virtual File System (SVFS) is an extension of this to provide a key/value store where the key is a file ID and the value a Sparse Virtual File.

svfsc is written in C++ with a Python interface. It is thread safe in both domains.

A SVF might be used like this:

  • The user requests some data (for example TIFF metadata) from a remote file using a Parser that knows the TIFF structure.

  • The Parser consults the SVF, if the SVF has the data then the Parser parses it and gives the results to the user.

  • If the SVF does not have the data then the Parser consults the SVF for what data is needed, then issues the appropriate GET request(s) to the remote server.

  • That data is used to update the SVF, then the parser can use it and give the results to the user.

Here is a conceptual example of a SVF running on a local file system containing data from a single file.

            CLIENT SIDE           |             LOCAL FILE SYSTEM
                                  .
/------\      /--------\          |              /-------------\
| User | <--> | Parser | <-- read(fpos, len) --> | File System |
\------/      \--------/          |              \-------------/
                   |              .
                   |              |
               /-------\          .
               |  SVF  |          |
               \-------/          .

Here is a conceptual example of an SVFS running with a remote file system.

            CLIENT SIDE           |             SERVER SIDE
                                  .
/------\      /--------\          |             /--------\
| User | <--> | Parser | <-- GET(fpos, len) --> | Server |
\------/      \--------/          |             \--------/
                   |              .                  |
                   |              |                  |
               /-------\          .           /-------------\
               |  SVF  |          |           | File System |
               \-------/          .           \-------------/

Example Python Usage

Installation

Install from pypi:

$ pip install svfsc

Using a Single SVF

This shows the basic functionality: write(), read() and need():

import svfsc

# Construct a Sparse Virtual File
svf = svfsc.cSVF('Some file ID')

# Write six bytes at file position 14
svf.write(14, b'ABCDEF')

# Read from it
svf.read(16, 2) # Returns b'CD'

# What do I have to do to read 24 bytes from file position 8?
# This returns a tuple of pairs ((file_position, read_length), ...)
svf.need(8, 24) # Returns ((8, 6), (20, 4))
# Go and get the data from those file positions and write it to
# the SVF then you can read directly from the SVF.

The basic operation is to check if the SVF has data, if not then get it and write that data to the SVF. Then read directly:

if not svf.has_data(file_position, length):
    for read_position, read_length in svf.need(file_position, length):
        # Somehow get the data as a bytes object at (read_position, read_length)...
        # This could be a GET request to a remote file.
        # Then...
        svf.write(read_position, data)
# Now read directly
svf.read(file_position, length)

A Sparse Virtual File System

The example above uses a single Sparse Virtual File, but you can also create a Sparse Virtual File System. This is a key/value store where the key is some string and the value a SVF:

import svfsc

svfs = svfsc.cSVFS()

# Insert an empty SVF with a corresponding ID
ID = 'abc'
svfs.insert(ID)

# Write six bytes to that SVF at file position 14
svfs.write(ID, 14, b'ABCDEF')

# Read from the SVF
svfs.read(ID, 16, 2) # Returns b'CD'

# What do I have to do to read 24 bytes from file position 8
# from that SVF?
svfs.need(ID, 8, 24) # Returns ((8, 6), (20, 4))

Example C++ Usage

svfsc is written in C++ so can be used directly:

#include "svf.h"

// File modification time of 1672574430.0 (2023-01-01 12:00:30)
SVFS::SparseVirtualFile svf("Some file ID", 1672574430.0);

// Write six char at file position 14
svf.write(14, "ABCDEF", 6);

// Read from it
char read_buffer[2];
svf.read(16, 2, read_buffer);
// read_buffer now contains "CD"

// What do I have to do to read 24 bytes from file position 8?
// This returns a std::vector<std::pair<size_t, size_t>>
// as ((file_position, read_length), ...)
auto need = svf.need(8, 24);

// The following prints ((8, 6), (20, 4),)
std::cout << "(";
for (auto &val: need) {
    std::cout << "(" << val.first << ", " << val.second << "),";
}
std::cout << ")" << std::endl;

Documentation

Build the documentation from the docs directory or find it on readthedocs: https://svfsc.readthedocs.io/

Acknowledgments

Many thanks to my employer Paige.ai for allowing me to release this as FOSS software.

History

0.4.0 (2024-02-11)

  • Add counters for blocks/bytes erased and blocks/bytes punted and then their associated APIs.

  • Use the SVFS_SVF_METHOD_SIZE_T_REGISTER macro in CPython to simplify registering CPython methods.

  • Fix builds on Linux, mainly compiler flags.
    • Move to -std=c++17 to exploit [[nodiscard]].

    • Better alignment of compiler flags between CMakeLists.txt and setup.py

  • Other minor fixes.

  • Because of the extensive use of this in various projects this version 0.4 is moved to production status: Development Status :: 5 - Production/Stable

0.3.0 (2024-01-06)

  • Add need_many().

  • Fix bug in lru_punt().

  • Development Status :: 4 - Beta

0.2.2 (2023-12-28)

  • Minor fixes.

  • Development Status :: 4 - Beta

0.2.1 (2023-12-27)

  • Include stub file.

  • Development Status :: 4 - Beta

0.2.0 (2023-12-24)

  • Add cache punting.

  • Make C docstrings type parsable (good for Sphinx) and add a script that can create a mypy stub file.

  • Development Status :: 4 - Beta

0.1.2 (2023-10-03)

  • First release.

  • Development Status :: 3 - Alpha

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svfsc-0.4.0.tar.gz (47.1 kB view details)

Uploaded Source

Built Distributions

svfsc-0.4.0-cp312-cp312-macosx_10_9_universal2.whl (124.9 kB view details)

Uploaded CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.4.0-cp311-cp311-macosx_10_9_universal2.whl (124.7 kB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.4.0-cp310-cp310-macosx_10_9_universal2.whl (124.6 kB view details)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.4.0-cp39-cp39-macosx_10_9_universal2.whl (124.6 kB view details)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.4.0-cp38-cp38-macosx_10_9_x86_64.whl (65.8 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file svfsc-0.4.0.tar.gz.

File metadata

  • Download URL: svfsc-0.4.0.tar.gz
  • Upload date:
  • Size: 47.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for svfsc-0.4.0.tar.gz
Algorithm Hash digest
SHA256 a31398de82ef2f416237191d78d363b2b888e3708b96b39b6c95982dbf156e38
MD5 a216aacbdd650aef97fb163fa45e7577
BLAKE2b-256 d1d941615960bdef39dda85d80a3cd63c3cf75a92702384aec4a81bd7885d1fc

See more details on using hashes here.

File details

Details for the file svfsc-0.4.0-cp312-cp312-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.4.0-cp312-cp312-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 2cb907f26925616e233239375212d3e34141f8b3c66742febfefee78b3330539
MD5 ab836a205a3e173b7ce0dc3689b2dc4b
BLAKE2b-256 3fcd2ee8a865cb78d37b4a5fc2ddb10312cb180a252c3c8bcf86900e9ccfe6d0

See more details on using hashes here.

File details

Details for the file svfsc-0.4.0-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.4.0-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 e39f48e494a5f131f937498e3104d0967cf009704d97e100c8cfad9cd0896d29
MD5 c84b4409ae887417b798b0c24bd9fbb7
BLAKE2b-256 a798dcc5097f188b23daf104ba92dc691097bb4848f0bd24cc523678d156718b

See more details on using hashes here.

File details

Details for the file svfsc-0.4.0-cp310-cp310-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.4.0-cp310-cp310-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 2c3fdf39df5c261a14da70fe6df7f9e1659bf689376f9c8b51470450ceec19dd
MD5 afba6072d207e73aa74abedbbdcdf4ba
BLAKE2b-256 7f7359099537e2951832f155b993c52e5b49dd4309a38895a095e9080290f0d2

See more details on using hashes here.

File details

Details for the file svfsc-0.4.0-cp39-cp39-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.4.0-cp39-cp39-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 05cc27af500e65588d339d064e77f629e620e74ddb98ac0042cf0301a16cc2ab
MD5 918aaf885fb371d295ae9b2ad2f9cfaa
BLAKE2b-256 b7a24daeb36c3ac1936e353483394e0c932b2866e55ef5bc16e4a7e8849da41b

See more details on using hashes here.

File details

Details for the file svfsc-0.4.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for svfsc-0.4.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 700eaef1ac5da54031166222e8bdca1495efe94cfc1bdafd3cfd5b74d6b6f1b1
MD5 e7d0cebf42c10abde424e7002326755c
BLAKE2b-256 9000c42a0928e2ccd4b4f7d1f3c84ec6ccb14c5047e827e8bc1310434daab727

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page