Skip to main content

Caching and synchronization for AIND metadata.

Project description

biodata-cache

License Code Style semantic-release: angular Interrogate Coverage Python

biodata-cache is a set of one-line functions that handle the entire process of caching and retrieving data (and metadata) from AIND data assets.

In the background, the cache repackages data/metadata into dataframes and stores them on S3 in versioned folders (data-asset-cache/bdc-v{version}/), or in memory for testing. Each release writes to its own versioned folder, so older versions of the website remain accessible while new versions are deployed. A top-level data-asset-cache/cache_versions.json index lists all available version folders.

Important: this package is not at 1.0. It is changing fast and breaking changes are still occurring, although rarely. To reduce the chance of impact on your code the cache tables are versioned. This does mean that if you want the latest version of the tables you need to keep biodata-cache up-to-date, but it also means your code won't immediately break when I change the way the tables work.

Installation

Note that you must set the backend to S3 or biodata-cache will automatically re-cache the tables locally in memory. This can take a LONG time.

pip install biodata-cache
export BIODATA_CACHE_BACKEND='S3'

Usage

Set backend

export BIODATA_CACHE_BACKEND='S3'

Options are 'S3', 'MEMORY'.

Fetch data

from biodata_cache import unique_project_names

project_names = unique_project_names()

Cache tables

Use get_cache_registry() to see all available cache tables and their metadata (descriptions, S3 paths, columns, etc.) for the installed version:

from biodata_cache import get_cache_registry

registry = get_cache_registry()

Use get_cache_versions() to list all available version folders across all deployed releases:

from biodata_cache import get_cache_versions

versions = get_cache_versions()

The per-version cache_registry.json lives at s3://allen-data-views/data-asset-cache/bdc-v{version}/cache_registry.json. The top-level index s3://allen-data-views/data-asset-cache/cache_versions.json lists all available version folders as a JSON array.

Hive-partitioned tables use key=value directory segments, enabling DuckDB queries like:

import duckdb
duckdb.query("""
    SELECT * FROM read_parquet(
        's3://allen-data-views/data-asset-cache/bdc-v0.27.3/qc/data.pqt',
        hive_partitioning=true,
        union_by_name=true
    )
""")

The raw_to_derived function is not a table stored in S3, instead it is used by passing an asset_name (or list of asset names) and a modality. The function returns the latest derived asset matching the requested pattern.

Custom cache table

The custom function allows you to store and retrieve your own user-defined DataFrames in the cache by name. This requires write authentication to the active backend.

from biodata_cache import custom
import pandas as pd

df = pd.DataFrame({"col": [1, 2, 3]})
custom("my_data", df)

retrieved_df = custom("my_data")

Update all cache tables

We run a nightly capsule on Code Ocean with this code to update all cache tables (not the custom ones).

from biodata_cache.sync import update_all_tables
update_all_tables()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biodata_cache-0.34.3.tar.gz (40.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biodata_cache-0.34.3-py3-none-any.whl (48.6 kB view details)

Uploaded Python 3

File details

Details for the file biodata_cache-0.34.3.tar.gz.

File metadata

  • Download URL: biodata_cache-0.34.3.tar.gz
  • Upload date:
  • Size: 40.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biodata_cache-0.34.3.tar.gz
Algorithm Hash digest
SHA256 1964038e6eefcc42b289f5c4462fda5c0f2ca394b54ac83c4003961732a8feb1
MD5 8b72459a4519782039a1023f92f5d55e
BLAKE2b-256 f4f7b66af168ca3120ed3af0b12e408e796ee3263d5d9d7853415f115ee6e199

See more details on using hashes here.

File details

Details for the file biodata_cache-0.34.3-py3-none-any.whl.

File metadata

  • Download URL: biodata_cache-0.34.3-py3-none-any.whl
  • Upload date:
  • Size: 48.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biodata_cache-0.34.3-py3-none-any.whl
Algorithm Hash digest
SHA256 109a93c6638cb1894d9af792c02a890637280059eae3b77d748090d8d00b1627
MD5 b26efd982c9d9ebe9369d33a7c6fabe7
BLAKE2b-256 fa9ec7b012db30b451e69587fb9ad4b9440e2a705fa385c18823767b70f8cc00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page