Caching and synchronization for AIND metadata.

These details have not been verified by PyPI

Project description

biodata-cache

Code Style Interrogate Coverage Python

biodata-cache is a set of one-line functions that handle the entire process of caching and retrieving data (and metadata) from AIND data assets.

In the background, the cache repackages data/metadata into dataframes and stores them on S3 in versioned folders (data-asset-cache/bdc-v{version}/), or in memory for testing. Each release writes to its own versioned folder, so older versions of the website remain accessible while new versions are deployed. A top-level data-asset-cache/cache_versions.json index lists all available version folders.

Important: this package is not at 1.0. It is changing fast and breaking changes are still occurring, although rarely. To reduce the chance of impact on your code the cache tables are versioned. This does mean that if you want the latest version of the tables you need to keep biodata-cache up-to-date, but it also means your code won't immediately break when I change the way the tables work.

Installation

Note that you must set the backend to S3 or biodata-cache will automatically re-cache the tables locally in memory. This can take a LONG time.

pip install biodata-cache
export BIODATA_CACHE_BACKEND='S3'

Usage

Set backend

export BIODATA_CACHE_BACKEND='S3'

Options are 'S3', 'MEMORY'.

Fetch data

from biodata_cache import unique_project_names

project_names = unique_project_names()

Cache tables

get_cache_registry returns the following information about all available cache tables. Paths are versioned — {version} is the installed biodata-cache package version (e.g. 0.27.3).

Table	Description	Location	Type	Partitioned	Columns
`unique_project_names`	Unique project names across all assets	`s3://allen-data-views/data-asset-cache/bdc-v{version}/unique_project_names.pqt`	metadata	False	`project_name`
`unique_subject_ids`	Unique subject_ids across all assets	`s3://allen-data-views/data-asset-cache/bdc-v{version}/unique_subject_ids.pqt`	metadata	False	`subject_id`
`unique_genotypes`	Unique genotypes across all assets where `subject.subject_details.genotype` is present	`s3://allen-data-views/data-asset-cache/bdc-v{version}/unique_genotypes.pqt`	metadata	False	`genotype`
`asset_basics`	Commonly used asset metadata, one row per data asset	`s3://allen-data-views/data-asset-cache/bdc-v{version}/asset_basics.pqt`	metadata	False	`_id`, `_last_modified`, `modalities`, `project_name`, `data_level`, `subject_id`, `acquisition_start_time`, `acquisition_end_time`, `code_ocean`, `process_date`, `genotype`, `age`, `acquisition_type`, `location`, `name`, `experimenters`, `experimenters_normalized`, `instrument_id`, `instrument_id_normalized`, `investigators`, `investigators_normalized`
`source_data`	Mapping from derived asset names to their source raw asset names	`s3://allen-data-views/data-asset-cache/bdc-v{version}/source_data.pqt`	metadata	False	`name`, `source_data`, `pipeline_name`, `processing_time`
`quality_control`	Quality control table with one row per QC metric, partitioned by subject_id	`s3://allen-data-views/data-asset-cache/bdc-v{version}/qc/`	asset	True (by `subject_id`)	`name`, `stage`, `modality`, `value`, `status`, `asset_name`
`platform_qc`	Tag-level QC statuses aggregated per platform, one row per asset/tag combination	`s3://allen-data-views/data-asset-cache/bdc-v{version}/platform_qc/`	platform	True (by `platform`)	`asset_name`, `tag`, `status`, `timestamp`, `instrument_id_normalized`, `experimenters_normalized`
`platform_smartspim`	SmartSPIM assets with processing status and neuroglancer links, one row per (asset, channel)	`s3://allen-data-views/data-asset-cache/bdc-v{version}/platform_smartspim.pqt`	metadata	False	`name`, `raw_name`, `processed`, `institution`, `processing_end_time`, `stitched_link`, `raw_link`, `channel`, `segmentation_link`, `quantification_link`, `alignment_link`
`platform_exaspim`	ExaSPIM assets with processing status and neuroglancer links, one row per asset	`s3://allen-data-views/data-asset-cache/bdc-v{version}/platform_exaspim.pqt`	metadata	False	`name`, `raw_name`, `processed`, `raw_link`, `fused_link`
`platform_fib`	Fiber photometry assets in long form, one row per asset/fiber/channel combination	`s3://allen-data-views/data-asset-cache/bdc-v{version}/platform_fib.pqt`	metadata	False	`asset_name`, `fiber`, `patch_cord`, `channel`, `intended_measurement`, `targeted_structure`
`metadata_upgrade`	Metadata upgrade status for each asset across versions	`s3://allen-data-views/data-asset-cache/bdc-v{version}/metadata_upgrade.pqt`	metadata	False	`_id`, `name`, `project_name`, `data_level`, `v2_id`, `upgrader_version`, `last_modified`, `status`, `upgrade_datetime`
`foraging_sessions`	Foraging behavior sessions with key performance metrics, one row per session	`s3://allen-data-views/data-asset-cache/bdc-v{version}/foraging_sessions.pqt`	metadata	False	`subject_id`, `session_date`, `session`, `nwb_suffix`, `rig`, `trainer`, `trainer_normalized`, `task`, `curriculum_name`, `curriculum_version`, `current_stage_actual`, `foraging_eff`, `foraging_eff_random_seed`, `finished_trials`, `finished_rate`, `total_trials`, `bias_naive`
`behavior_curriculum`	Behavior assets with curriculum name and stage, one row per behavior asset	`s3://allen-data-views/data-asset-cache/bdc-v{version}/behavior_curriculum.pqt`	asset	False	`asset_name`, `curriculum_name`, `stage_name`, `stage_node_id`
`time_to_qc`	Time from processing completion to QC completion for derived assets	`s3://allen-data-views/data-asset-cache/bdc-v{version}/time_to_qc.pqt`	metadata	False	`name`, `process_end_time`, `qc_time`

Hive-partitioned tables use key=value directory segments, enabling DuckDB queries like:

import duckdb
duckdb.query("""
    SELECT * FROM read_parquet(
        's3://allen-data-views/data-asset-cache/bdc-v0.27.3/qc/**',
        hive_partitioning=true,
        union_by_name=true
    )
""")

The cache_registry.json registry lives at s3://allen-data-views/data-asset-cache/bdc-v{version}/cache_registry.json. The top-level s3://allen-data-views/data-asset-cache/cache_versions.json lists all available version folders as a JSON array.

The raw_to_derived function is not a table stored in S3, instead it is used by passing an asset_name (or list of asset names) and a modality. The function returns the latest derived asset matching the requested pattern.

Custom cache table

The custom function allows you to store and retrieve your own user-defined DataFrames in the cache by name. This requires write authentication to the active backend.

from biodata_cache import custom
import pandas as pd

df = pd.DataFrame({"col": [1, 2, 3]})
custom("my_data", df)

retrieved_df = custom("my_data")

Update all cache tables

We run a nightly capsule on Code Ocean with this code to update all cache tables (not the custom ones).

from biodata_cache.sync import update_all_tables
update_all_tables()

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.35.0

Jun 26, 2026

0.34.3

Jun 23, 2026

0.34.2

Jun 23, 2026

This version

0.34.1

Jun 22, 2026

0.34.0

Jun 19, 2026

0.33.5

Jun 17, 2026

0.33.4

Jun 17, 2026

0.33.3

Jun 17, 2026

0.33.2

Jun 11, 2026

0.33.1

Jun 11, 2026

0.33.0

Jun 11, 2026

0.32.2

Jun 11, 2026

0.32.1

Jun 11, 2026

0.32.0

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biodata_cache-0.34.1.tar.gz (41.9 kB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

biodata_cache-0.34.1-py3-none-any.whl (49.5 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file biodata_cache-0.34.1.tar.gz.

File metadata

Download URL: biodata_cache-0.34.1.tar.gz
Upload date: Jun 22, 2026
Size: 41.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biodata_cache-0.34.1.tar.gz
Algorithm	Hash digest
SHA256	`c6df2cc2b5033d08d5df99161a656546015a5ac3a9812773d822459b817826ca`
MD5	`cfc7ff7cfff4a8ac9475f849b36d7822`
BLAKE2b-256	`599fc0ae850279e66d7de0d29c44db9a02421b89510aaaea5cf669d3017f6fac`

See more details on using hashes here.

File details

Details for the file biodata_cache-0.34.1-py3-none-any.whl.

File metadata

Download URL: biodata_cache-0.34.1-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 49.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biodata_cache-0.34.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`01875abd6bd2dbab3c50d91794dba98ba7b87e4c9b88262bbc05b3986ab7ac2c`
MD5	`0b2cd78758fd3ac193fb5e87ac7a60c7`
BLAKE2b-256	`22ae2d3a5beaccb70edb8e2d0a3062a08a930076e9bc79c216e3637ea054e769`

See more details on using hashes here.

biodata-cache 0.34.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

biodata-cache

Installation

Usage

Set backend

Fetch data

Cache tables

Custom cache table

Update all cache tables

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes