Skip to main content

A library for working with Flywheel datasets

Project description

fw-dataset

This repository contains classes and functions for creating, managing, and serving Flywheel Datasets. Flywheel Datasets are a way to organize, share, and query data from the Flywheel Data Model.

[[TOC]]

[!important] This Python package is under active development and should be considered unstable. It is provided as-is, without any guarantee of support or maintenance at this stage. Features may be incomplete, change without notice, or be removed in future versions. Use at your own risk for experimental or development purposes only.

Getting started

Installation

The fw-dataset package has been built for use with Python 3.10 and above. It can be installed with pip:

pip install fw-dataset

or poetry:

poetry add fw-dataset

Usage

Rendering Datasets

See notebooks/quickstart_dataset_creation.ipynb for a walkthrough of using the DatasetBuilder to render a Flywheel dataset.

Accessing and Managing Datasets

See notebooks/quickstart_dataset_management.ipynb for a walkthrough of using the FWDatasetClient to access and query a Flywheel dataset.

Unassociated Datasets

If you have a valid dataset that is not associated with a Flywheel project, you can still use the FWDatasetClient to access the dataset. You will need to provide the type,bucket, prefix, and credentials of cloud or local filesystem to instantiate and query the dataset.

from fw_dataset import FWDatasetClient

# There is no need to provide an API-Key or instantiate the dataset client

fs_type = "s3" # or "gcs", "azure", "fs", "local"
bucket = "your-bucket"
prefix = "your-prefix"
credentials = {"url": "{bucket-specific-credential-string}"}

dataset = FWDatasetClient.get_dataset_from_filesystem(fs_type, bucket, prefix, credentials)

Merging Related Datasets

If you have multiple datasets that have related tables you want to query together, you can merge the datasets into a single dataset.

NOTE: Federated Querying is not yet enabled across datasets. This is a work in progress.

Requirements
  1. The source dataset must have a valid tables directory structure.

  2. The source dataset must have a valid schemas directory structure.

    • Every table in the tables directory must have a valid corresponding schema file in the schemas directory.

    • The schema file must be named {table_name}.schema.json where {table_name} is the name of the table that the schema describes.

    • The schema file must be a valid JSON file with the minimum structure:

      {
          "schema": "http://json-schema.org/draft-07/schema#",
          "id": "{table_name}",
          "description": "",
          "properties": {},
          "required": [],
          "type": "object"
      }
      
  3. The destination dataset must have the same requirements as the source dataset.

  4. Tables and schemas selected from the source MUST NOT have the same names as existing ones in the destination

Once the above requirements have been met, you may merge the datasets by copying or moving the selected tables and schemas from the source dataset to the destination dataset.

Flywheel Project Requirements

For the Flywheel Dataset Client and the Dataset objects to function, the following requirements must be met:

Flywheel Project Structure

The Flywheel Project must have the following valid custom information metadata:

{
    "dataset": {
        "type": "s3",
        "bucket": "{bucket-name}",
        "prefix": "{path/to/dataset}",
        "storage_id": "storage-id-of-fw-storage-object"
    }
}

type

The type field must be one of the following:

  • s3: The dataset is stored in an S3 bucket.
  • gcs: The dataset is stored in a Google Cloud Storage bucket.
  • azure: The dataset is stored in an Azure Blob Storage container.
  • fs,local: The dataset is stored on a local filesystem.

bucket

The bucket field is the name of the bucket or container where the dataset is stored.

prefix

The prefix field is the path to the dataset within the bucket or container.

The directory structure beneath the prefix should be as described in the Dataset Structure section.

storage_id

The storage_id field is the Flywheel ID of the cloud storage record that describes the filesystem or cloud storage bucket that the dataset is stored in. This should be a valid storage object in the Flywheel database.

Dataset Structure

The dataset should be stored in the bucket or container with the following structure:

{bucket}/{prefix}/
├── latest/
|   └── latest/
|       ├── provenance/
|          └── dataset_description.json
|       ├── tables/
|          └── {table_name}/ (a directory structure of partitioned parquet files)
|              └── /{partitions}/{hash}.parquet
|       └── schemas/
|          └── {table_name}.schema.json
└── versions/          
  ├── latest_version.json (provenance/dataset_description.json of versions/latest)
  └── {version}/
      ├── provenance/
         └── dataset_description.json
      ├── tables/
         └── {table_name}/ (a directory structure of partitioned parquet files)
             └── /{partitions}/{hash}.parquet
      └── schemas/
         └── {table_name}.schema.json

The latest_version.json file is a copy of the provenance/dataset_description.json. Both of these are minimal descriptions of a dataset version. The latest directory represents the latest version of the dataset. Archived versions of the dataset are also stored in the versions directory for archival purposes. They can be deleted once they are no longer needed.

The above structure is more completely described in the Dataset Definition Document in the docs directory.

Schema Files

The schema files are JSON files that describe the schema of the tables in the dataset. The schema files are stored in the schemas directory. The schema files are named {table_name}.schema.json where {table_name} is the name of the table that the schema describes.

Ideally, the schema files should be fully descriptive. However, if a minimal schema is desired merely to allow the dataset to be queried, the schema file can be as simple as:

{
    "schema": "http://json-schema.org/draft-07/schema#",
    "id": "{table_name}",
    "description": "Table derived from Tabular Data File: conditions.csv",
    "properties": {},
    "required": [],
    "type": "object"
}

Appendix

Flywheel Data Model

The Flywheel Data Model is a hierarchical structure that organizes data in a Flywheel Project. The Flywheel Data Model is organized as follows:

  • Project (has files and analyses)
  • Subject (has files and analyses)
  • Session (has files and analyses)
  • Acquisition (has files and analyses)
  • File
  • Analysis

The SQLite snapshot of the Flywheel Data Model has each of the above entities as tables. The tables consist of an id column and a data column. The data column is a binary string containing the JSON representation of each entity.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fw_dataset-0.3.0-py3-none-any.whl (60.5 kB view details)

Uploaded Python 3

File details

Details for the file fw_dataset-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: fw_dataset-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 60.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/5.15.154+

File hashes

Hashes for fw_dataset-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc658d5639d3123c395ffc28bc2685abd7799f992ea09e4f227bb86ceef23e69
MD5 779c9a2ac99bd87bb8471a44c36752f2
BLAKE2b-256 c32dd6f5861b9fb01c0e072276aeec817a390af2ab6a86620e4e0d5cce721a62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page