A library for working with Flywheel datasets
Project description
fw-dataset
This repository contains classes and functions for creating, managing, and serving Flywheel Datasets. Flywheel Datasets are a way to organize, share, and query data from the Flywheel Data Model.
[[TOC]]
[!important] This Python package is under active development and should be considered unstable. It is provided as-is, without any guarantee of support or maintenance at this stage. Features may be incomplete, change without notice, or be removed in future versions. Use at your own risk for experimental or development purposes only.
Getting started
Installation
The fw-dataset package has been built for use with Python 3.10 and above. It can be
installed with pip:
pip install fw-dataset
or poetry:
poetry add fw-dataset
Usage
Rendering Datasets
See
notebooks/quickstart_dataset_creation.ipynb
for a walkthrough of using the DatasetBuilder to render a Flywheel dataset.
Accessing and Managing Datasets
See
notebooks/quickstart_dataset_management.ipynb
for a walkthrough of using the FWDatasetClient to access and query a Flywheel dataset.
Unassociated Datasets
If you have a valid dataset that is not associated with a Flywheel project, you can
still use the FWDatasetClient to access the dataset. You will need to provide the
type,bucket, prefix, and credentials of cloud or local filesystem to instantiate
and query the dataset.
from fw_dataset import FWDatasetClient
# There is no need to provide an API-Key or instantiate the dataset client
fs_type = "s3" # or "gcs", "azure", "fs", "local"
bucket = "your-bucket"
prefix = "your-prefix"
credentials = {"url": "{bucket-specific-credential-string}"}
dataset = FWDatasetClient.get_dataset_from_filesystem(fs_type, bucket, prefix, credentials)
Merging Related Datasets
If you have multiple datasets that have related tables you want to query together, you can merge the datasets into a single dataset.
NOTE: Federated Querying is not yet enabled across datasets. This is a work in progress.
Requirements
-
The
sourcedataset must have a validtablesdirectory structure. -
The
sourcedataset must have a validschemasdirectory structure.-
Every table in the
tablesdirectory must have a valid corresponding schema file in theschemasdirectory. -
The schema file must be named
{table_name}.schema.jsonwhere{table_name}is the name of the table that the schema describes. -
The schema file must be a valid JSON file with the minimum structure:
{ "schema": "http://json-schema.org/draft-07/schema#", "id": "{table_name}", "description": "", "properties": {}, "required": [], "type": "object" }
-
-
The
destinationdataset must have the same requirements as thesourcedataset. -
Tables and schemas selected from the
sourceMUST NOT have the same names as existing ones in thedestination
Once the above requirements have been met, you may merge the datasets by copying or
moving the selected tables and schemas from the source dataset to the destination
dataset.
Flywheel Project Requirements
For the Flywheel Dataset Client and the Dataset objects to function, the following requirements must be met:
Flywheel Project Structure
The Flywheel Project must have the following valid custom information metadata:
{
"dataset": {
"type": "s3",
"bucket": "{bucket-name}",
"prefix": "{path/to/dataset}",
"storage_id": "storage-id-of-fw-storage-object"
}
}
type
The type field must be one of the following:
s3: The dataset is stored in an S3 bucket.gcs: The dataset is stored in a Google Cloud Storage bucket.azure: The dataset is stored in an Azure Blob Storage container.fs,local: The dataset is stored on a local filesystem.
bucket
The bucket field is the name of the bucket or container where the dataset is stored.
prefix
The prefix field is the path to the dataset within the bucket or container.
The directory structure beneath the prefix should be as described in the Dataset
Structure section.
storage_id
The storage_id field is the Flywheel ID of the cloud storage record that describes the
filesystem or cloud storage bucket that the dataset is stored in. This should be a valid
storage object in the Flywheel database.
Dataset Structure
The dataset should be stored in the bucket or container with the following structure:
{bucket}/{prefix}/
└── versions/
└── {version}/
├── provenance/
│ ├── dataset_description.json
│ ├── snapshot.db.gz
│ ├── snapshot_info.json
│ └── project.json
├── tables/
│ └── {table_name}/
│ └── {hash}.parquet
└── schemas/
└── {table_name}.schema.json
Each version is stored in a separate subdirectory named with its version identifier
(typically a BSON ID like "66cf6701af1c6f3855f1ee61"). The "latest" version is
determined dynamically by comparing the creation dates in each version's
dataset_description.json file.
The above structure is more completely described in the Dataset
Definition Document in the docs
directory.
Schema Files
The schema files are JSON files that describe the schema of the tables in the dataset.
The schema files are stored in the schemas directory. The schema files are named
{table_name}.schema.json where {table_name} is the name of the table that the schema
describes.
Ideally, the schema files should be fully descriptive. However, if a minimal schema is desired merely to allow the dataset to be queried, the schema file can be as simple as:
{
"schema": "http://json-schema.org/draft-07/schema#",
"id": "{table_name}",
"description": "Table derived from Tabular Data File: conditions.csv",
"properties": {},
"required": [],
"type": "object"
}
Appendix
Flywheel Data Model
The Flywheel Data Model is a hierarchical structure that organizes data in a Flywheel Project. The Flywheel Data Model is organized as follows:
Project(hasfilesandanalyses)Subject(hasfilesandanalyses)Session(hasfilesandanalyses)Acquisition(hasfilesandanalyses)FileAnalysis
The SQLite snapshot of the Flywheel Data Model has each of the above entities as tables.
The tables consist of an id column and a data column. The data column is a binary
string containing the JSON representation of each entity.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fw_dataset-0.4.0-py3-none-any.whl.
File metadata
- Download URL: fw_dataset-0.4.0-py3-none-any.whl
- Upload date:
- Size: 59.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Alpine Linux","version":"3.24.0_alpha20260127","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b7038280ae55463caab9c19a5391636c52a5815a7885fbe5ff56f2dd57c8b24
|
|
| MD5 |
cb0b348950c901c5c45d0d83e40da455
|
|
| BLAKE2b-256 |
af79b92f13d2fde1dcabdb967b768a1f24778767cede3bc3bf8a5620a66f7668
|