Skip to main content

Sparse Virtual File System Cache implemented in C++.

Project description

Sparse Virtual File

Introduction

Somtimes you don’t need the whole file. Sometimes you don’t want the whole file. Especially if it is huge and on some remote server. But, you might know what parts of the file that you want and svfsc can help you store them locally so it looks as if you have access to the complete file but with just the pieces of interest.

svfsc is targeted at reading very large binary files such as TIFF, RP66V1, HDF5 where the structure is well known. For example you might want to parse a TIFF file for its metadata or for a particular image tile or strip which is a tiny fraction of the file itself.

svfsc implements a Sparse Virtual File, a specialised in-memory cache where a particular file might not be available but parts of it can be obtained without reading the whole file. A Sparse Virtual File (SVF) is represented internally as a map of blocks of data with the key being their file offsets. Any write to an SVF will coalesce these blocks where possible. There is no cache punting strategy implemented so an SVF always accumulates data. A Sparse Virtual File System (SVFS) is an extension of this to provide a key/value store where the key is a file ID and the value a Sparse Virtual File.

svfsc is written in C++ with a Python interface. It is thread safe in both domains.

A SVF might be used like this:

  • The user requests some data (for example TIFF metadata) from a remote file using a Parser that knows the TIFF structure.

  • The Parser consults the SVF, if the SVF has the data then the Parser parses it and gives the results to the user.

  • If the SVF does not have the data then the Parser consults the SVF for what data is needed, then issues the appropriate GET request(s) to the remote server.

  • That data is used to update the SVF, then the parser can use it and give the results to the user.

Here is a conceptual example of a SVF running on a local file system containing data from a single file.

            CLIENT SIDE           |             LOCAL FILE SYSTEM
                                  .
/------\      /--------\          |              /-------------\
| User | <--> | Parser | <-- read(fpos, len) --> | File System |
\------/      \--------/          |              \-------------/
                   |              .
                   |              |
               /-------\          .
               |  SVF  |          |
               \-------/          .

Here is a conceptual example of an SVFS running with a remote file system.

            CLIENT SIDE           |             SERVER SIDE
                                  .
/------\      /--------\          |             /--------\
| User | <--> | Parser | <-- GET(fpos, len) --> | Server |
\------/      \--------/          |             \--------/
                   |              .                  |
                   |              |                  |
               /-------\          .           /-------------\
               |  SVF  |          |           | File System |
               \-------/          .           \-------------/

Example Python Usage

Installation

Install from pypi:

$ pip install svfsc

Using a Single SVF

This shows the basic functionality: write(), read() and need():

import svfsc

# Construct a Sparse Virtual File
svf = svfsc.cSVF('Some file ID')

# Write six bytes at file position 14
svf.write(14, b'ABCDEF')

# Read from it
svf.read(16, 2) # Returns b'CD'

# What do I have to do to read 24 bytes from file position 8?
# This returns a tuple of pairs ((file_position, read_length), ...)
svf.need(8, 24) # Returns ((8, 6), (20, 4))
# Go and get the data from those file positions and write it to
# the SVF then you can read directly from the SVF.

The basic operation is to check if the SVF has data, if not then get it and write that data to the SVF. Then read directly:

if not svf.has_data(file_position, length):
    for read_position, read_length in svf.need(file_position, length):
        # Somehow get the data as a bytes object at (read_position, read_length)...
        # This could be a GET request to a remote file.
        # Then...
        svf.write(read_position, data)
# Now read directly
svf.read(file_position, length)

A Sparse Virtual File System

The example above uses a single Sparse Virtual File, but you can also create a Sparse Virtual File System. This is a key/value store where the key is some string and the value a SVF:

import svfsc

svfs = svfsc.cSVFS()

# Insert an empty SVF with a corresponding ID
ID = 'abc'
svfs.insert(ID)

# Write six bytes to that SVF at file position 14
svfs.write(ID, 14, b'ABCDEF')

# Read from the SVF
svfs.read(ID, 16, 2) # Returns b'CD'

# What do I have to do to read 24 bytes from file position 8
# from that SVF?
svfs.need(ID, 8, 24) # Returns ((8, 6), (20, 4))

Example C++ Usage

svfsc is written in C++ so can be used directly:

#include "svf.h"

// File modification time of 1672574430.0 (2023-01-01 12:00:30)
SVFS::SparseVirtualFile svf("Some file ID", 1672574430.0);

// Write six char at file position 14
svf.write(14, "ABCDEF", 6);

// Read from it
char read_buffer[2];
svf.read(16, 2, read_buffer);
// read_buffer now contains "CD"

// What do I have to do to read 24 bytes from file position 8?
// This returns a std::vector<std::pair<size_t, size_t>>
// as ((file_position, read_length), ...)
auto need = svf.need(8, 24);

// The following prints ((8, 6), (20, 4),)
std::cout << "(";
for (auto &val: need) {
    std::cout << "(" << val.first << ", " << val.second << "),";
}
std::cout << ")" << std::endl;

Documentation

Build the documentation from the docs directory or find it on readthedocs: https://svfsc.readthedocs.io/

Acknowledgments

Many thanks to my employer Paige.ai for allowing me to release this as FOSS software.

History

0.2.2 (2023-12-28)

  • Minor fixes.

  • Development Status :: 4 - Beta

0.2.1 (2023-12-27)

  • Include stub file.

  • Development Status :: 4 - Beta

0.2.0 (2023-12-24)

  • Add cache punting.

  • Make C docstrings type parsable (good for Sphinx) and add a script that can create a mypy stub file.

  • Development Status :: 4 - Beta

0.1.2 (2023-10-03)

  • First release. Development Status :: 3 - Alpha

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svfsc-0.2.2.tar.gz (44.3 kB view details)

Uploaded Source

Built Distributions

svfsc-0.2.2-cp312-cp312-macosx_10_9_universal2.whl (112.6 kB view details)

Uploaded CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.2.2-cp311-cp311-macosx_10_9_universal2.whl (112.4 kB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.2.2-cp310-cp310-macosx_10_9_universal2.whl (112.4 kB view details)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.2.2-cp39-cp39-macosx_10_9_universal2.whl (112.3 kB view details)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

svfsc-0.2.2-cp38-cp38-macosx_10_9_x86_64.whl (59.5 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file svfsc-0.2.2.tar.gz.

File metadata

  • Download URL: svfsc-0.2.2.tar.gz
  • Upload date:
  • Size: 44.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for svfsc-0.2.2.tar.gz
Algorithm Hash digest
SHA256 aac42a3ba5ed37ebc7d83c32cd6374ebfb39cc921b4a648f93c65c62b7fdcc6a
MD5 4fcdb807e605b9e7e98966c797639721
BLAKE2b-256 02de2ba01bf1605a1b058a0d6f1c2b63866b412cad4206e5b335b60d033703e7

See more details on using hashes here.

File details

Details for the file svfsc-0.2.2-cp312-cp312-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.2.2-cp312-cp312-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 fb36d27d42e3828043dabca18b3c8e373fccc15b833529f1bb641a16453647a4
MD5 c9a67b5d2533e41667722d4ded4a2905
BLAKE2b-256 415a5283b93a1966a5a423dbc41f5a3db7f1d582bc0d60c27433c15e82f807c3

See more details on using hashes here.

File details

Details for the file svfsc-0.2.2-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.2.2-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 5e23e3f1beb62d02a2f619bef2b788e698980ad705b5ee9e216d383aa0d78f91
MD5 6ae375f2570a17b1a571b130a0155358
BLAKE2b-256 060fddc92a82dffef1a1b8a11ab3a5c5dd78c0f72763358a0f2e25c90978e812

See more details on using hashes here.

File details

Details for the file svfsc-0.2.2-cp310-cp310-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.2.2-cp310-cp310-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 ff0bb74b7850e3c56895358c8af6b39af99485e68ba61c2e8c4831d7780af5f6
MD5 f0a97598227e2037069e5f656e7f19e2
BLAKE2b-256 e864427762b69802fbf3043346f79548dd88a8f2ddf889ac2f98351c00d0914d

See more details on using hashes here.

File details

Details for the file svfsc-0.2.2-cp39-cp39-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for svfsc-0.2.2-cp39-cp39-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 9b2cbe56bac24e7f6d4406f6aca7e891865c0f967d3132ff311e3714bd7e1a4b
MD5 f748f699db3cb7b1c6334e3575c26350
BLAKE2b-256 ba13999cc02e35e9190cc7d07e096a2d795a9d6f2073ffaf3b2024d2fa692fc3

See more details on using hashes here.

File details

Details for the file svfsc-0.2.2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for svfsc-0.2.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 698a8b681bcc9558321e65aa9bae635bd143fb9542c7d2014a34ae0d087e0277
MD5 e93e0be7099ee131f4d4e6d30ee0d8fa
BLAKE2b-256 e58b1462df33c587883c2ac56ab1cc2a296256797cc3b395aab5c0220f9179ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page