Skip to main content

A library for working with Flywheel datasets

Project description

fw-dataset

This repository contains classes and functions for creating, managing, and serving Flywheel Datasets. Flywheel Datasets are a way to organize, share, and query data from the Flywheel Data Model.

Work In Progress

This is a work in progress. All functionality is not yet implemented.

Getting started

Installation

The fw-dataset package has been built for use with Python 3.10 and above. It can be installed with pip:

pip install fw-dataset

or poetry:

poetry add fw-dataset

Usage

Accessing Datasets

The fw-dataset package provides a FWDatasetClient class that can be used to access existing Flywheel datasets on cloud storage or local filesystems.

from fw_dataset import FWDatasetClient

# Create a client with a Flywheel API-Key
api_key = "your-api-key"
dataset_client = FWClient(api_key=api_key)

# If you are in a Flywheel Jupyter Workspace with the environment variables 
# FW_HOSTNAME and FW_WS_API_KEY set, the following will work:
# dataset_client = FWClient()

# list existing datasets (see below for Flywheel Project Requirements)
datasets = dataset_client.datasets()

# link to a specific project-associated dataset
# by project id
project_id = "your-project-id"
dataset = dataset_client.dataset(project_id=project_id)

# or by project path
group = "your-group"
project_label = "your-project-label"
dataset = dataset_client.dataset(project_path=f"fw://{group}/{project_label}")

# connect the dataset to all underlying data
conn = dataset.connect()

# query the dataset
SQL = "SELECT * FROM acquisitions"

# get the results
results = conn.execute(SQL)
result_df = results.df()
result_df.head()

Rendering Datasets

The fw-dataset package provides a DatasetBuilder class that can be used to render a dataset from a Flywheel project. The DatasetBuilder renders the dataset structure and metadata from a Flywheel project into a local or cloud storage structure.

from fw_dataset.admin.dataset_builder import DatasetBuilder

# Create a client with a Flywheel API-Key
api_key = "your-api-key"
project_id = "your-project-id"
storage_id = "your-storage-id"

# Initialize the dataset builder with an api-key, project-id, and storage-id
dataset_builder =  DatasetBuilder(api_key=api_key, project_id=project_id, storage_id=storage_id)

# Render the dataset structure and metadata
dataset = dataset_builder.render_dataset()

# Connect to the dataset
conn = dataset.connect()

# Query the dataset
SQL = "SELECT * FROM subjects LIMIT 10"
conn.execute(SQL).df()

The Dataset Structure will be rendered in the storage bucket or local storage under the path specified by:

{bucket}/datasets/{instance}/{group}/{project_id}/latest/

If the latest directory already exists, and is the version you are trying to render, the Dataset object is returned. If the latest directory does not exist, the latest directory is created and the Dataset object is returned. If you creating a new dataset from a current project snapshot is desired, use the force_new parameter:

dataset = dataset_builder.render_dataset(force_new=True)

Additionally, if you want to render a projects tabular data files and custom information into dataset tables and schemas, you must use the following flags:

dataset = dataset_builder.render_dataset(parse_tabular_data=True, parse_custom_info=True)

Unassociated Datasets

If you have a valid dataset that is not associated with a Flywheel project, you can still use the FWDatasetClient to access the dataset. You will need to provide the type,bucket, prefix, and credentials of cloud or local filesystem to instantiate and query the dataset.

from fw_dataset import FWDatasetClient

# There is no need to provide an API-Key or instantiate the dataset client

fs_type = "s3" # or "gcs", "azure", "fs", "local"
bucket = "your-bucket"
prefix = "your-prefix"
credentials = {"url": "{bucket-specific-credential-string}"}

dataset = FWDatasetClient.get_dataset_from_filesystem(fs_type, bucket, prefix, credentials)

Merging Related Datasets

If you have multiple datasets that have related tables you want to query together, you can merge the datasets into a single dataset.

NOTE: Federated Querying is not yet enabled across datasets. This is a work in progress.

Requirements
  1. The source dataset must have a valid tables directory structure.

  2. The source dataset must have a valid schemas directory structure.

    • Every table in the tables directory must have a valid corresponding schema file in the schemas directory.

    • The schema file must be named {table_name}.schema.json where {table_name} is the name of the table that the schema describes.

    • The schema file must be a valid JSON file with the minimum structure:

      {
          "schema": "http://json-schema.org/draft-07/schema#",
          "id": "{table_name}",
          "description": "",
          "properties": {},
          "required": [],
          "type": "object"
      }
      
  3. The destination dataset must have the same requirements as the source dataset.

  4. Tables and schemas selected from the source MUST NOT have the same names as existing ones in the destination

Once the above requirements have been met, you may merge the datasets by copying or moving the selected tables and schemas from the source dataset to the destination dataset.

Flywheel Project Requirements

For the Flywheel Dataset Client and the Dataset objects to function, the following requirements must be met:

Flywheel Project Structure

The Flywheel Project must have the following valid custom information metadata:

{
    "dataset": {
        "type": "s3",
        "bucket": "bucket-name",
        "prefix": "path/to/dataset",
        "storage_id": "storage-id-of-fw-storage-object"
    }
}

type

The type field must be one of the following:

  • s3: The dataset is stored in an S3 bucket.
  • gcs: The dataset is stored in a Google Cloud Storage bucket.
  • azure: The dataset is stored in an Azure Blob Storage container.
  • fs,local: The dataset is stored on a local filesystem.

bucket

The bucket field is the name of the bucket or container where the dataset is stored.

prefix

The prefix field is the path to the dataset within the bucket or container.

The directory structure beneath the prefix should be as described in the Dataset Structure section.

storage_id

The storage_id field is the Flywheel ID of the cloud storage record that describes the filesystem or cloud storage bucket that the dataset is stored in. This should be a valid storage object in the Flywheel database.

Dataset Structure

The dataset should be stored in the bucket or container with the following structure:

{bucket}/{prefix}/
├── latest/
|   └── latest/
|       ├── provenance/
|          └── dataset_description.json
|       ├── tables/
|          └── {table_name}/ (a directory structure of partitioned parquet files)
|              └── /{partitions}/{hash}.parquet
|       └── schemas/
|          └── {table_name}.schema.json
└── versions/          
  ├── latest_version.json (provenance/dataset_description.json of versions/latest)
  └── {version}/
      ├── provenance/
         └── dataset_description.json
      ├── tables/
         └── {table_name}/ (a directory structure of partitioned parquet files)
             └── /{partitions}/{hash}.parquet
      └── schemas/
         └── {table_name}.schema.json

The latest_version.json file is a copy of the provenance/dataset_description.json. Both of these are minimal descriptions of a dataset version. The latest directory represents the latest version of the dataset. Archived versions of the dataset are also stored in the versions directory for archival purposes. They can be deleted once they are no longer needed.

The above structure is more completely described in the Dataset Definition Document in the docs directory.

Schema Files

The schema files are JSON files that describe the schema of the tables in the dataset. The schema files are stored in the schemas directory. The schema files are named {table_name}.schema.json where {table_name} is the name of the table that the schema describes.

Ideally, the schema files should be fully descriptive. However, if a minimal schema is desired merely to allow the dataset to be queried, the schema file can be as simple as:

{
    "schema": "http://json-schema.org/draft-07/schema#",
    "id": "{table_name}",
    "description": "Table derived from Tabular Data File: conditions.csv",
    "properties": {},
    "required": [],
    "type": "object"
}

Future Development

Future development will include:

  • Dataset creation and management from library
    • Create a new dataset from a Flywheel project
    • Dataset will be structured on local or cloud storage
    • Dataset essentials will be stored in the Flywheel project metadata
    • Dataset versions can be deleted from the storage structure
    • Dataset versions can be archived
    • Dataset can be removed from a Flywheel project

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fw_dataset-0.1.0rc11-py3-none-any.whl (38.7 kB view details)

Uploaded Python 3

File details

Details for the file fw_dataset-0.1.0rc11-py3-none-any.whl.

File metadata

  • Download URL: fw_dataset-0.1.0rc11-py3-none-any.whl
  • Upload date:
  • Size: 38.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/5.15.154+

File hashes

Hashes for fw_dataset-0.1.0rc11-py3-none-any.whl
Algorithm Hash digest
SHA256 4191b1679337c633fa47c9450a0bc3079f18113df7bc63ee65a909ed753e9c7d
MD5 2aa5ac6c396a2535ecbbbb408e3ca24d
BLAKE2b-256 6dcf326ce02b9bee84986bccd8cd370abf092be9e28b5c458c0dabf6fa8b34ff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page