Skip to main content

Read-only FUSE filesystem views over VCF Zarr data

Project description

CI PyPI Downloads

biofuse

Read-only views of VCF Zarr (VCZ) data in standard bioinformatics file formats via a FUSE filesystem. Currently supported views:

  • PLINK 1.9 binary (.bed / .bim / .fam) — via mount-plink.
  • Oxford BGEN (.bgen / .sample / .bgen.bgi) — via mount-bgen.

The streaming file (.bed / .bgen) is generated on demand using the matching vcztools encoder; the static sidecars are computed once at mount time.

Stability and correctness

A core design principle of biofuse is that the mount must never become unresponsive. All the work of decoding VCF Zarr and encoding it into PLINK or BGEN bytes is delegated to vcztools; biofuse itself does one thing — present that data as a correct, dependable read-only filesystem. Keeping the two responsibilities separate keeps the surface biofuse has to get exactly right small.

  • The filesystem stays responsive under load. Encoding runs off the filesystem's request-handling path, and every read and open is bounded by a timeout: a slow or stuck encode returns a normal I/O error (EIO / EAGAIN) rather than blocking. One wedged file handle cannot freeze the others, and unmount never hangs.
  • Failures are contained. An error inside the encoder surfaces to the caller as an I/O error, not a crash — the mount keeps serving every other file.
  • The view is read-only and immutable. Writes, truncation and appends are rejected with EROFS; the sidecars are computed once when the mount starts and served unchanged for its lifetime.
  • POSIX behaviour is tested. A dedicated filesystem test harness (fs_tests/) exercises syscall semantics (read / pread / lseek, stat, mmap, directory listing, write rejection), cross-checks the served bytes against a reference, and runs read-stress and liveness probes that confirm the mount stays responsive while the streaming file is saturated.

Performance and access patterns

biofuse is optimised for linear, sequential reads — the access pattern used by the majority of downstream tools, which stream variants start-to-end. The streaming .bed / .bgen file is encoded on demand as the consumer reads forward, and bytes already produced are buffered, so reading straight through the file does no redundant work. The mounts are verified against plink1.9 and plink2 (--bfile, --freq, --missing, --hardy, …) for PLINK, and bgenix, qctool, REGENIE, SAIGE, BOLT-LMM and plink2 --bgen for BGEN.

Random and backward access still work, but are slower: seeking backwards or skipping far ahead can make biofuse re-encode from an earlier point in the file. The kernel page cache holds bytes that have already been served, so re-reading a region — and multi-pass tools that scan the file more than once (e.g. flashpca) — stays cheap once the data is warm.

For BGEN, the .bgen payload uses zlib level 0 (stored, fixed-size variant blocks) together with the .bgen.bgi index, so a tool can fetch an individual variant by byte range without decompressing or re-encoding the rest of the file — variant-targeted access (e.g. bgenix -v) is efficient as well as whole-file scans.

The sidecar files (.bim / .fam / .sample / .bgen.bgi) are computed once when the mount starts, so reads of them are always fast regardless of access order. These can be suppressed individually where not needed (e.g., the .bgen.bgi can be large and is not needed for many workloads).

Because the streaming file is produced on demand, a read that stalls beyond an internal timeout surfaces as EIO rather than blocking indefinitely; in practice this only appears under pathological random-access load.

Install

biofuse depends on libfuse 3 system headers (pyfuse3 builds from source):

sudo apt-get install -y fuse3 libfuse3-dev pkg-config

Then:

python -m pip install biofuse      # or: uv pip install biofuse

Remote and zipped stores

The vcz_url argument and the inherited --backend-storage / --storage-option options accept cloud, fsspec, and HTTP stores, plus .vcz.zip files. biofuse depends on bare vcztools; to mount cloud-backed stores install the matching vcztools extra, e.g. pip install 'vcztools[obstore]' or pip install 'vcztools[icechunk]'. See the vcztools documentation for the available storage backends.

Usage

mount-plink

biofuse mount-plink path/to/sample.vcz /mount/dir

Mounts a read-only directory at /mount/dir containing sample.bed, sample.bim, sample.fam. The mount runs in the foreground; press Ctrl-C to unmount.

Options:

  • --basename NAME — basename for the plink fileset (defaults to the VCZ stem).
  • --access-log PATH — record every read as a JSONL row to PATH (useful for characterising consumer access patterns).
  • The bcftools-view-style filter / backend / log options (-r/-R/-s/-S/-t/-T/-i/-e/-v/-V/-m/-M, --backend-storage, --storage-option, --log-level, --log-file) are inherited from vcztools view-plink. Run biofuse mount-plink --help or see vcztools view-plink --help for the full reference.

Example:

mkdir /tmp/plink-mnt
biofuse mount-plink ./sample.vcz /tmp/plink-mnt &
# The mount runs in the foreground, so it is backgrounded with `&`. It is
# not ready the instant the process starts — it first opens the VCZ and
# builds the sidecars — so wait for the mounted file to appear before
# running the consumer tool.
until [ -e /tmp/plink-mnt/sample.bed ]; do sleep 0.1; done
plink1.9 --bfile /tmp/plink-mnt/sample --freq --out ./out
fusermount3 -u /tmp/plink-mnt

mount-bgen

biofuse mount-bgen path/to/sample.vcz /mount/dir

Mounts a read-only directory at /mount/dir containing sample.bgen, sample.sample, sample.bgen.bgi. The .bgen payload uses zlib level 0 (stored, fixed-size variant blocks) so byte-range random access is O(1); downstream tools (bgenix, qctool, REGENIE, SAIGE, BOLT-LMM, plink2 --bgen) consume the mount unchanged. The .bgen.bgi SQLite sidecar and .sample are generated once at mount time.

Options mirror mount-plink: --basename, --access-log, and the shared bcftools-style filter / backend / log set inherited from vcztools view-bgen. Run biofuse mount-bgen --help or see vcztools view-bgen --help for the full reference.

Example:

mkdir /tmp/bgen-mnt
biofuse mount-bgen ./sample.vcz /tmp/bgen-mnt &
# Wait for the mount to come up before reading from it (see mount-plink above).
until [ -e /tmp/bgen-mnt/sample.bgen ]; do sleep 0.1; done
bgenix -g /tmp/bgen-mnt/sample.bgen -list
fusermount3 -u /tmp/bgen-mnt

Limitations: ploidy

  • Mixed ploidy is not supported by mount-bgen. The fixed-size BGEN encoder used for random-access serving requires uniform ploidy across every sample and variant in the view. Mounts whose region includes mixed-ploidy chromosomes (typically X, Y, MT) open successfully and serve .sample and .bgen.bgi, but the first .bgen read will fail with EIO. Workaround: restrict the view to autosomes at mount time (e.g. via the inherited -r / -R / -t / -T region filters), or use the one-shot vcztools view-bgen CLI for full-file conversions that include X / Y / MT — view-bgen uses the streaming variable-size encoder which handles mixed ploidy correctly.
  • Pure haploid VCZ is supported by mount-bgen (the encoder emits a uniform-haploid BGEN payload).
  • mount-plink is diploid-only. Pure haploid VCZ inputs (e.g. mitochondrial-only stores) are rejected by the underlying encoder with EIO on the first .bed read. Mixed-ploidy VCZ inputs serve successfully, but haploid samples are encoded as homozygous for the called allele — this matches the PLINK 1 BED format, which has no haploid representation.

Development

uv sync --group dev
uv run pytest                          # full suite
uv run pytest tests/test_encoder_ops.py  # one module
uv run prek install                    # install git pre-commit hook (one-off)
uv run --only-group=lint prek -c prek.toml run --all-files

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biofuse-0.1.0.tar.gz (181.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biofuse-0.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file biofuse-0.1.0.tar.gz.

File metadata

  • Download URL: biofuse-0.1.0.tar.gz
  • Upload date:
  • Size: 181.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for biofuse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d79a69c886c25acb90c5afc29a0a2211295e237091c7c89b879a7aecaee1d6fa
MD5 eda29b81fc4e6eba81f8476116ba58e0
BLAKE2b-256 bcc72bed2661f0007cf34ae749c42f624d9eb696d71ab2687d335839f45bb4e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for biofuse-0.1.0.tar.gz:

Publisher: cd.yml on sgkit-dev/biofuse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biofuse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: biofuse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for biofuse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8443755e814b81a8a398c512008826f0dd4e4f6ec24b2db75a3b3d511fc5b708
MD5 5d984d0738d6763f0ab1357a9e0ff431
BLAKE2b-256 a1ada3b64fd190ca0ea545f185b72a58a6840719a1d6dffa1b737086059c126f

See more details on using hashes here.

Provenance

The following attestation bundles were made for biofuse-0.1.0-py3-none-any.whl:

Publisher: cd.yml on sgkit-dev/biofuse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page