Skip to main content

Caching and synchronization for AIND metadata.

Project description

biodata-cache

License Code Style semantic-release: angular Interrogate Coverage Python

biodata-cache is a set of one-line functions that handle the entire process of caching and retrieving data (and metadata) from AIND data assets.

In the background, the cache repackages data/metadata into dataframes and stores them on S3 in versioned folders (data-asset-cache/bdc-v{version}/), or in memory for testing. Each release writes to its own versioned folder, so older versions of the website remain accessible while new versions are deployed. A top-level data-asset-cache/cache_versions.json index lists all available version folders.

Important: this package is not at 1.0. It is changing fast and breaking changes are still occurring, although rarely. To reduce the chance of impact on your code the cache tables are versioned. This does mean that if you want the latest version of the tables you need to keep biodata-cache up-to-date, but it also means your code won't immediately break when I change the way the tables work.

Installation

Note that you must set the backend to S3 or biodata-cache will automatically re-cache the tables locally in memory. This can take a LONG time.

pip install biodata-cache
export BIODATA_CACHE_BACKEND='S3'

Usage

Set backend

export BIODATA_CACHE_BACKEND='S3'

Options are 'S3', 'MEMORY'.

Fetch data

from biodata_cache import unique_project_names

project_names = unique_project_names()

Cache tables

get_cache_registry returns the following information about all available cache tables. Paths are versioned — {version} is the installed biodata-cache package version (e.g. 0.27.3).

Table Description Location Type Partitioned Columns
unique_project_names Unique project names across all assets s3://allen-data-views/data-asset-cache/bdc-v{version}/unique_project_names.pqt metadata False project_name
unique_subject_ids Unique subject_ids across all assets s3://allen-data-views/data-asset-cache/bdc-v{version}/unique_subject_ids.pqt metadata False subject_id
unique_genotypes Unique genotypes across all assets where subject.subject_details.genotype is present s3://allen-data-views/data-asset-cache/bdc-v{version}/unique_genotypes.pqt metadata False genotype
asset_basics Commonly used asset metadata, one row per data asset s3://allen-data-views/data-asset-cache/bdc-v{version}/asset_basics.pqt metadata False _id, _last_modified, modalities, project_name, data_level, subject_id, acquisition_start_time, acquisition_end_time, code_ocean, process_date, genotype, age, acquisition_type, location, name, experimenters, experimenters_normalized, instrument_id, instrument_id_normalized, investigators, investigators_normalized
source_data Mapping from derived asset names to their source raw asset names s3://allen-data-views/data-asset-cache/bdc-v{version}/source_data.pqt metadata False name, source_data, pipeline_name, processing_time
quality_control Quality control table with one row per QC metric, partitioned by subject_id s3://allen-data-views/data-asset-cache/bdc-v{version}/qc/ asset True (by subject_id) name, stage, modality, value, status, asset_name
platform_qc Tag-level QC statuses aggregated per platform, one row per asset/tag combination s3://allen-data-views/data-asset-cache/bdc-v{version}/platform_qc/ platform True (by platform) asset_name, tag, status, timestamp, instrument_id_normalized, experimenters_normalized
platform_smartspim SmartSPIM assets with processing status and neuroglancer links, one row per (asset, channel) s3://allen-data-views/data-asset-cache/bdc-v{version}/platform_smartspim.pqt metadata False name, raw_name, processed, institution, processing_end_time, stitched_link, raw_link, channel, segmentation_link, quantification_link, alignment_link
platform_exaspim ExaSPIM assets with processing status and neuroglancer links, one row per asset s3://allen-data-views/data-asset-cache/bdc-v{version}/platform_exaspim.pqt metadata False name, raw_name, processed, raw_link, fused_link
platform_fib Fiber photometry assets in long form, one row per asset/fiber/channel combination s3://allen-data-views/data-asset-cache/bdc-v{version}/platform_fib.pqt metadata False asset_name, fiber, patch_cord, channel, intended_measurement, targeted_structure
metadata_upgrade Metadata upgrade status for each asset across versions s3://allen-data-views/data-asset-cache/bdc-v{version}/metadata_upgrade.pqt metadata False _id, name, project_name, data_level, v2_id, upgrader_version, last_modified, status, upgrade_datetime
foraging_sessions Foraging behavior sessions with key performance metrics, one row per session s3://allen-data-views/data-asset-cache/bdc-v{version}/foraging_sessions.pqt metadata False subject_id, session_date, session, nwb_suffix, rig, trainer, trainer_normalized, task, curriculum_name, curriculum_version, current_stage_actual, foraging_eff, foraging_eff_random_seed, finished_trials, finished_rate, total_trials, bias_naive
behavior_curriculum Behavior assets with curriculum name and stage, one row per behavior asset s3://allen-data-views/data-asset-cache/bdc-v{version}/behavior_curriculum.pqt asset False asset_name, curriculum_name, stage_name, stage_node_id
time_to_qc Time from processing completion to QC completion for derived assets s3://allen-data-views/data-asset-cache/bdc-v{version}/time_to_qc.pqt metadata False name, process_end_time, qc_time

Hive-partitioned tables use key=value directory segments, enabling DuckDB queries like:

import duckdb
duckdb.query("""
    SELECT * FROM read_parquet(
        's3://allen-data-views/data-asset-cache/bdc-v0.27.3/qc/**',
        hive_partitioning=true,
        union_by_name=true
    )
""")

The cache_registry.json registry lives at s3://allen-data-views/data-asset-cache/bdc-v{version}/cache_registry.json. The top-level s3://allen-data-views/data-asset-cache/cache_versions.json lists all available version folders as a JSON array.

The raw_to_derived function is not a table stored in S3, instead it is used by passing an asset_name (or list of asset names) and a modality. The function returns the latest derived asset matching the requested pattern.

Custom cache table

The custom function allows you to store and retrieve your own user-defined DataFrames in the cache by name. This requires write authentication to the active backend.

from biodata_cache import custom
import pandas as pd

df = pd.DataFrame({"col": [1, 2, 3]})
custom("my_data", df)

retrieved_df = custom("my_data")

Update all cache tables

We run a nightly capsule on Code Ocean with this code to update all cache tables (not the custom ones).

from biodata_cache.sync import update_all_tables
update_all_tables()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biodata_cache-0.33.2.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biodata_cache-0.33.2-py3-none-any.whl (46.0 kB view details)

Uploaded Python 3

File details

Details for the file biodata_cache-0.33.2.tar.gz.

File metadata

  • Download URL: biodata_cache-0.33.2.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biodata_cache-0.33.2.tar.gz
Algorithm Hash digest
SHA256 953b0bbe6c39a977c900d96835c8d4f745539ffec8132b411a060f18cdc34fab
MD5 da26fb6952bc43efa4a49232f381b0d8
BLAKE2b-256 7ba8905ba60a3f9286272d975d503fa1e9a79822ba4265fd97f03f0af63bbda7

See more details on using hashes here.

File details

Details for the file biodata_cache-0.33.2-py3-none-any.whl.

File metadata

  • Download URL: biodata_cache-0.33.2-py3-none-any.whl
  • Upload date:
  • Size: 46.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biodata_cache-0.33.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7d1c2ac297272b1ef45a80cb92f7cb66dee53137b9f6b8be13bea3c24c0fa746
MD5 6d1f51c7c0060b48af87c6151d0a525d
BLAKE2b-256 592640f15ed52e89a99f413895799e2d44d1b762bbd8d891f35b01d584177ae0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page