Skip to main content

Sparse Virtual File System Cache implemented in C++.

Project description

Sparse Virtual File

Introduction

Somtimes you don’t need the whole file. Sometimes you don’t want the whole file. Especially if it is huge and on some remote server. But, you might know what parts of the file that you want and svfsc can help you store them locally so it looks as if you have access to the complete file but with just the pieces of interest.

svfsc is targeted at reading very large binary files such as TIFF, RP66V1, HDF5 where the structure is well known. For example you might want to parse a TIFF file for its metadata or for a particular image tile or strip which is a tiny fraction of the file itself.

svfsc implements a Sparse Virtual File, a specialised in-memory cache where a particular file might not be available but parts of it can be obtained without reading the whole file. A Sparse Virtual File (SVF) is represented internally as a map of blocks of data with the key being their file offsets. Any write to an SVF will coalesce these blocks where possible. There is no cache punting strategy implemented so an SVF always accumulates data. A Sparse Virtual File System (SVFS) is an extension of this to provide a key/value store where the key is a file ID and the value a Sparse Virtual File.

svfsc is written in C++ with a Python interface. It is thread safe in both domains.

A SVF might be used like this:

  • The user requests some data (for example TIFF metadata) from a remote file using a Parser that knows the TIFF structure.

  • The Parser consults the SVF, if the SVF has the data then the Parser parses it and gives the results to the user.

  • If the SVF does not have the data then the Parser consults the SVF for what data is needed, then issues the appropriate GET request(s) to the remote server.

  • That data is used to update the SVF, then the parser can use it and give the results to the user.

Here is a conceptual example of a SVF running on a local file system containing data from a single file.

            CLIENT SIDE           |             LOCAL FILE SYSTEM
                                  .
/------\      /--------\          |              /-------------\
| User | <--> | Parser | <-- read(fpos, len) --> | File System |
\------/      \--------/          |              \-------------/
                   |              .
                   |              |
               /-------\          .
               |  SVF  |          |
               \-------/          .

Here is a conceptual example of an SVFS running with a remote file system.

            CLIENT SIDE           |             SERVER SIDE
                                  .
/------\      /--------\          |             /--------\
| User | <--> | Parser | <-- GET(fpos, len) --> | Server |
\------/      \--------/          |             \--------/
                   |              .                  |
                   |              |                  |
               /-------\          .           /-------------\
               |  SVF  |          |           | File System |
               \-------/          .           \-------------/

Example Python Usage

Installation

Install from pypi:

$ pip install svfsc

Using a Single SVF

This shows the basic functionality: write(), read() and need():

import svfsc

# Construct a Sparse Virtual File
svf = svfsc.cSVF('Some file ID')

# Write six bytes at file position 14
svf.write(14, b'ABCDEF')

# Read from it
svf.read(16, 2) # Returns b'CD'

# What do I have to do to read 24 bytes from file position 8?
# This returns a tuple of pairs ((file_position, read_length), ...)
svf.need(8, 24) # Returns ((8, 6), (20, 4))
# Go and get the data from those file positions and write it to
# the SVF then you can read directly from the SVF.

The basic operation is to check if the SVF has data, if not then get it and write that data to the SVF. Then read directly:

if not svf.has_data(file_position, length):
    for read_position, read_length in svf.need(file_position, length):
        # Somehow get the data as a bytes object at (read_position, read_length)...
        # This could be a GET request to a remote file.
        # Then...
        svf.write(read_position, data)
# Now read directly
svf.read(file_position, length)

A Sparse Virtual File System

The example above uses a single Sparse Virtual File, but you can also create a Sparse Virtual File System. This is a key/value store where the key is some string and the value a SVF:

import svfsc

svfs = svfsc.cSVFS()

# Insert an empty SVF with a corresponding ID
ID = 'abc'
svfs.insert(ID)

# Write six bytes to that SVF at file position 14
svfs.write(ID, 14, b'ABCDEF')

# Read from the SVF
svfs.read(ID, 16, 2) # Returns b'CD'

# What do I have to do to read 24 bytes from file position 8
# from that SVF?
svfs.need(ID, 8, 24) # Returns ((8, 6), (20, 4))

Example C++ Usage

svfsc is written in C++ so can be used directly:

#include "svf.h"

// File modification time of 1672574430.0 (2023-01-01 12:00:30)
SVFS::SparseVirtualFile svf("Some file ID", 1672574430.0);

// Write six char at file position 14
svf.write(14, "ABCDEF", 6);

// Read from it
char read_buffer[2];
svf.read(16, 2, read_buffer);
// read_buffer now contains "CD"

// What do I have to do to read 24 bytes from file position 8?
// This returns a std::vector<std::pair<size_t, size_t>>
// as ((file_position, read_length), ...)
auto need = svf.need(8, 24);

// The following prints ((8, 6), (20, 4),)
std::cout << "(";
for (auto &val: need) {
    std::cout << "(" << val.first << ", " << val.second << "),";
}
std::cout << ")" << std::endl;

Documentation

Build the documentation from the docs directory or find it on readthedocs: https://svfsc.readthedocs.io/

Acknowledgments

Many thanks to my employer Paige.ai for allowing me to release this as FOSS software.

History

0.2.0 (2023-12-24)

  • Add cache punting.

  • Make C docstrings type parsable (good for Sphinx) and add a script that can create a mypy stub file.

  • Development Status :: 4 - Beta

0.1.2 (2023-10-03)

  • First release. Development Status :: 3 - Alpha

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svfsc-0.2.1.tar.gz (44.2 kB view details)

Uploaded Source

Built Distributions

svfsc-0.2.1-cp312-cp312-macosx_10_9_universal2.whl (107.2 kB view details)

Uploaded CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.2.1-cp311-cp311-macosx_10_9_universal2.whl (107.0 kB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.2.1-cp310-cp310-macosx_10_9_universal2.whl (107.0 kB view details)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.2.1-cp39-cp39-macosx_10_9_x86_64.whl (57.4 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

svfsc-0.2.1-cp38-cp38-macosx_10_9_x86_64.whl (57.4 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file svfsc-0.2.1.tar.gz.

File metadata

  • Download URL: svfsc-0.2.1.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for svfsc-0.2.1.tar.gz
Algorithm Hash digest
SHA256 70cc380f4c22ef2fcb7bda446ac584a1076300c8a41c6b1019374b8abe37a9cd
MD5 e1bbb7177957ed3e1f216799ca28d5f6
BLAKE2b-256 7d3783b72d5483f43911d1a0348db3fa7c2fe5abbef3a125e6f150db5140d598

See more details on using hashes here.

File details

Details for the file svfsc-0.2.1-cp312-cp312-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.2.1-cp312-cp312-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 43867b82d68d3bce91527fb965e72c58c2f069f668089ea2c346f01dfaea1ffe
MD5 b5a306aefcb01743b38a3eea7fe47268
BLAKE2b-256 e9fd39853dfb7e9fbec10483a39f49b3231c0ccedc7904c0c6285f0d0e3b161d

See more details on using hashes here.

File details

Details for the file svfsc-0.2.1-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.2.1-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 5f2b11253204de6b949349319144d3a14599af164e56cfcb39c9381f76170afe
MD5 adea2f2adb897abc06ad31d7927b2f6d
BLAKE2b-256 7a0c907f1f3b523f1a0a075f9cb239efdc227a019762b97ed85e9ff88da7d1b3

See more details on using hashes here.

File details

Details for the file svfsc-0.2.1-cp310-cp310-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.2.1-cp310-cp310-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 ea8729e75970097da9c0d5eeddd4a4d6ff324310fdee130368f3ed810735dc62
MD5 9a0d05d8c11bc2bc9ff49b7c1f2bf29a
BLAKE2b-256 04a7a1324c0d7d26b814f61170f2d30089bf19bc4883fff389f2955a02457b60

See more details on using hashes here.

File details

Details for the file svfsc-0.2.1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for svfsc-0.2.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fb98f7d64330de3a9e360e01aed06df97217a1ef7492029b1ccbb33414705372
MD5 d913dfb81837d63f1a600ce66a8fe045
BLAKE2b-256 2e30f68ac43f8b83c5f53bd744bfa083131a99a2967078f5dfaaec379afd800e

See more details on using hashes here.

File details

Details for the file svfsc-0.2.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for svfsc-0.2.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ac24915997544c2ceb113822bb603aa24d17c892fd8ae9243fff8b54d2246479
MD5 45e3b306e4c68ce89550943fa79ab4bd
BLAKE2b-256 94cd08e05d638f28678edd3c20fe4d988721e1ecf7cdfd7f5359ef3b162d8fe9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page