Skip to main content

A library for working with Flywheel datasets

Project description

fw-dataset

This repository contains classes and functions for creating, managing, and serving Flywheel Datasets. Flywheel Datasets are a way to organize, share, and query data from the Flywheel Data Model.

[[TOC]]

[!important] This Python package is under active development and should be considered unstable. It is provided as-is, without any guarantee of support or maintenance at this stage. Features may be incomplete, change without notice, or be removed in future versions. Use at your own risk for experimental or development purposes only.

Getting started

Installation

The fw-dataset package has been built for use with Python 3.10 and above. It can be installed with pip:

pip install fw-dataset

or poetry:

poetry add fw-dataset

Usage

Rendering Datasets

See notebooks/quickstart_dataset_creation.ipynb for a walkthrough of using the DatasetBuilder to render a Flywheel dataset.

Accessing and Managing Datasets

See notebooks/quickstart_dataset_management.ipynb for a walkthrough of using the FWDatasetClient to access and query a Flywheel dataset.

Unassociated Datasets

If you have a valid dataset that is not associated with a Flywheel project, you can still use the FWDatasetClient to access the dataset. You will need to provide the type,bucket, prefix, and credentials of cloud or local filesystem to instantiate and query the dataset.

from fw_dataset import FWDatasetClient

# There is no need to provide an API-Key or instantiate the dataset client

fs_type = "s3" # or "gcs", "azure", "fs", "local"
bucket = "your-bucket"
prefix = "your-prefix"
credentials = {"url": "{bucket-specific-credential-string}"}

dataset = FWDatasetClient.get_dataset_from_filesystem(fs_type, bucket, prefix, credentials)

Merging Related Datasets

If you have multiple datasets that have related tables you want to query together, you can merge the datasets into a single dataset.

NOTE: Federated Querying is not yet enabled across datasets. This is a work in progress.

Requirements
  1. The source dataset must have a valid tables directory structure.

  2. The source dataset must have a valid schemas directory structure.

    • Every table in the tables directory must have a valid corresponding schema file in the schemas directory.

    • The schema file must be named {table_name}.schema.json where {table_name} is the name of the table that the schema describes.

    • The schema file must be a valid JSON file with the minimum structure:

      {
          "schema": "http://json-schema.org/draft-07/schema#",
          "id": "{table_name}",
          "description": "",
          "properties": {},
          "required": [],
          "type": "object"
      }
      
  3. The destination dataset must have the same requirements as the source dataset.

  4. Tables and schemas selected from the source MUST NOT have the same names as existing ones in the destination

Once the above requirements have been met, you may merge the datasets by copying or moving the selected tables and schemas from the source dataset to the destination dataset.

Flywheel Project Requirements

For the Flywheel Dataset Client and the Dataset objects to function, the following requirements must be met:

Flywheel Project Structure

The Flywheel Project must have the following valid custom information metadata:

{
    "dataset": {
        "type": "s3",
        "bucket": "{bucket-name}",
        "prefix": "{path/to/dataset}",
        "storage_id": "storage-id-of-fw-storage-object"
    }
}

type

The type field must be one of the following:

  • s3: The dataset is stored in an S3 bucket.
  • gcs: The dataset is stored in a Google Cloud Storage bucket.
  • azure: The dataset is stored in an Azure Blob Storage container.
  • fs,local: The dataset is stored on a local filesystem.

bucket

The bucket field is the name of the bucket or container where the dataset is stored.

prefix

The prefix field is the path to the dataset within the bucket or container.

The directory structure beneath the prefix should be as described in the Dataset Structure section.

storage_id

The storage_id field is the Flywheel ID of the cloud storage record that describes the filesystem or cloud storage bucket that the dataset is stored in. This should be a valid storage object in the Flywheel database.

Dataset Structure

The dataset should be stored in the bucket or container with the following structure:

{bucket}/{prefix}/
└── versions/
    └── {version}/
        ├── provenance/
           ├── dataset_description.json
           ├── snapshot.db.gz
           ├── snapshot_info.json
           └── project.json
        ├── tables/
           └── {table_name}/
               └── {hash}.parquet
        └── schemas/
            └── {table_name}.schema.json

Each version is stored in a separate subdirectory named with its version identifier (typically a BSON ID like "66cf6701af1c6f3855f1ee61"). The "latest" version is determined dynamically by comparing the creation dates in each version's dataset_description.json file.

The above structure is more completely described in the Dataset Definition Document in the docs directory.

Schema Files

The schema files are JSON files that describe the schema of the tables in the dataset. The schema files are stored in the schemas directory. The schema files are named {table_name}.schema.json where {table_name} is the name of the table that the schema describes.

Ideally, the schema files should be fully descriptive. However, if a minimal schema is desired merely to allow the dataset to be queried, the schema file can be as simple as:

{
    "schema": "http://json-schema.org/draft-07/schema#",
    "id": "{table_name}",
    "description": "Table derived from Tabular Data File: conditions.csv",
    "properties": {},
    "required": [],
    "type": "object"
}

Appendix

Flywheel Data Model

The Flywheel Data Model is a hierarchical structure that organizes data in a Flywheel Project. The Flywheel Data Model is organized as follows:

  • Project (has files and analyses)
  • Subject (has files and analyses)
  • Session (has files and analyses)
  • Acquisition (has files and analyses)
  • File
  • Analysis

The SQLite snapshot of the Flywheel Data Model has each of the above entities as tables. The tables consist of an id column and a data column. The data column is a binary string containing the JSON representation of each entity.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fw_dataset-0.3.3-py3-none-any.whl (59.4 kB view details)

Uploaded Python 3

File details

Details for the file fw_dataset-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: fw_dataset-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 59.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Alpine Linux","version":"3.24.0_alpha20260127","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for fw_dataset-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c8823097c43d90d18a820f26c40334dfebbcda8feed8aa17d042e84b3aec1c9d
MD5 1565888f8e7443b5fa3eed0467370337
BLAKE2b-256 cfe5e369299e245643d6c71a2895e663d711b87af437732bb58f0e6b09fe3ec9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page