A library for working with Flywheel datasets
Project description
fw-dataset
This repository contains classes and functions for creating, managing, and serving Flywheel Datasets. Flywheel Datasets are a way to organize, share, and query data from the Flywheel Data Model.
Work In Progress
This is a work in progress. All functionality is not yet implemented.
Getting started
Installation
Once the package is published, you can install it with pip:
pip install fw-dataset
or poetry:
poetry add fw-dataset
Usage
from fw_dataset import FWDatasetClient
# Create a client with a Flywheel API-Key
api_key = "your-api-key"
dataset_client = FWClient(api_key=api_key)
# list existing datasets (see below for Flywheel Project Requirements)
datasets = dataset_client.datasets()
# link to a specific project-associated dataset
# by project id
project_id = "your-project-id"
dataset = dataset_client.dataset(project_id=project_id)
# or by project path
group = "your-group"
project_label = "your-project-label"
dataset = dataset_client.dataset(project_path=f"fw://{group}/{project_label}")
# connect the dataset to all underlying data
conn = dataset.connect()
# query the dataset
SQL = "SELECT * FROM acquisitions"
# get the results
results = conn.execute(SQL)
result_df = results.df()
result_df.head()
Unassociated Datasets
If you have a dataset that is not associated with a Flywheel project, you can still
use the FWDatasetClient
to access the dataset. You will need to provide the
type
,bucket
, prefix
, and credentials
of cloud or local filesystem to instantiate
and query the dataset.
from fw_dataset import FWDatasetClient
# Create a client with a Flywheel API
# TODO: make this work with a client that doesn't require an API key
dataset_client = FWDatasetClient(api_key=api_key)
fs_type = "s3" # or "gcs", "azure", "fs", "local"
bucket = "your-bucket"
prefix = "your-prefix"
credentials = {"url": "{bucket-specific-credential-string}"}
# TODO: make this a class method (e.g. FWDatasetClient.get_dataset_from_filesystem)
dataset = dataset_client.get_dataset_from_filesystem(fs_type, bucket, prefix, credentials)
Flywheel Project Requirements
For the Flywheel Dataset Client and the Dataset objects to function, the following requirements must be met:
Flywheel Project Structure
The Flywheel Project must have the following valid custom information metadata:
{
"dataset": {
"type": "s3",
"bucket": "bucket-name",
"prefix": "path/to/dataset",
"storage_id": "storage-id-of-fw-storage-object"
}
}
type
The type
field must be one of the following:
s3
: The dataset is stored in an S3 bucket.gcs
: The dataset is stored in a Google Cloud Storage bucket.azure
: The dataset is stored in an Azure Blob Storage container.fs
,local
: The dataset is stored on a local filesystem.
bucket
The bucket
field is the name of the bucket or container where the dataset is stored.
prefix
The prefix
field is the path to the dataset within the bucket or container.
The directory structure beneath the prefix
should be as described in the
Dataset Structure section.
storage_id
The storage_id
field is the Flywheel ID of the cloud storage record that describes the
filesystem or cloud storage bucket that the dataset is stored in. This should be a valid
storage object in the Flywheel database.
Dataset Structure
The dataset should be stored in the bucket or container with the following structure:
{bucket}/{prefix}/
├── latest_version.json (provenance/dataset_description.json of versions/latest)
└── versions/
└── latest/
├── provenance/
│ └── dataset_description.json
├── tables/
│ └── {table_name}/ (a directory structure of partitioned parquet files)
│ └── /{partitions}/{hash}.parquet
└── schemas/
└── {table_name}.schema.json
The latest_version.json
file is a copy of the provenance/dataset_description.json
.
Both of these are minimal descriptions of a dataset version. The latest
directory
represents the latest version of the dataset. Archived versions of the dataset are also
stored in the versions
directory for archival purposes. They can be deleted once they
are no longer needed.
The above structure is more completely described in the
Dataset Definition Document in the
docs
directory.
Schema Files
The schema files are JSON files that describe the schema of the tables in the dataset.
The schema files are stored in the schemas
directory. The schema files are named
{table_name}.schema.json
where {table_name}
is the name of the table that the schema
describes.
Ideally, the schema files should be fully descriptive. However, if a minimal schema is desired merely to allow the dataset to be queried, the schema file can be as simple as:
{
"schema": "http://json-schema.org/draft-07/schema#",
"id": "{table_name}",
"description": "Table derived from Tabular Data File: conditions.csv",
"properties": {},
"required": [],
"type": "object"
}
Future Development
Future development will include:
- Dataset creation and management from library
- Create a new dataset from a Flywheel project
- Dataset will be structured on local or cloud storage
- Dataset essentials will be stored in the Flywheel project metadata
- Dataset versions can be deleted from the storage structure
- Dataset versions can be archived
- Dataset can be removed from a Flywheel project
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file fw_dataset-0.1.0rc2-py3-none-any.whl
.
File metadata
- Download URL: fw_dataset-0.1.0rc2-py3-none-any.whl
- Upload date:
- Size: 25.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.7 Linux/5.15.154+
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45e6a098a1028f4b6a3a4fc75c675f9e3938dd7dcd442c195034b03df2a37317 |
|
MD5 | dd4ae2f4d7abe762c22f6cb15a49f797 |
|
BLAKE2b-256 | daead1aea540bccd3581d81a3fa08a8a1c558d7807640d0a75ecca9f4d814438 |