Skip to main content

Data processing pipeline for a nucleate pool boiling apparatus

Project description

boilerdata

DOI All Contributors

Data processing pipeline for a nucleate pool boiling apparatus.

Overview

The data processing approach taken in this repository started over at pdpipewrench. It was initially conceptualized as a way to outfit pdpipe pipelines from configuration files, allowing for Pandas pipeline orchestration with minimal code. I have since adopted a less aggressive tact, where I still separate configuration out into YAML files (constants, file paths, pipeline function arguments, etc.), but pipeline logic is handled in pipeline.py. I have also done away with using pdpipe in this approach, as it doesn't lend itself particularly well to ETL. Besides, my data processing need is not quite the "flavor" of statistical data science type approaches supported by pdpipe.

This new approach maintains the benefits of writing logic in Python, while allowing configuration in files. I am using Pydantic as the interface between my configs and my logic, which allows me to specify allowable values with Enums and other typing constructs. Expressing allowable configurations with Pydantic allows for generation of schema for your config files, raising errors on typos or missing keys, for example. I also specify the "shape" of my input and output data in configs, and validate my dataframes with pandera. Once these components are in place, it is easy to implement new functionality in the pipeline.

Usage

If you would like to adopt this approach to processing your own data, you may clone this repository and begin swapping configs and logic for your own, or use a similar architecture for your data processing. To run a working example with some actual data from this study, perform the following steps:

  1. Clone this repository and open it in your terminal or IDE (e.g. git clone https://github.com/blakeNaccarato/boilerdata.git boilerdata).
  2. Navigate to the clone directory in a terminal window (e.g. cd boilerdata).
  3. Create a Python 3.10 virtual environment (e.g. py -3.10 -m venv .venv on Windows w/ Python 3.10 installed from python.org).
  4. Activate the virtual environment (e.g. .venv/scripts/activate on Windows).
  5. Run pip install --editable . to install boilerdata package in an editable fashion. This step may take awhile.
  6. Delete the top-level data and config directories, then copy the config and data folders inside of tests/data to the root directory.
  7. Copy the .propshop folder in tests/data/.propshop to your user-folder (e.g. C:/Users/<you>/.propshop on Windows).
  8. Run dvc repro metrics to execute the data process up to that stage.

The data process should run the following stages: axes, modelfun, runs, parse_benchmarsk, pipeline, and metrics. Some stages are skipped because we specified to run just the necessary stages up to metrics (the example data doesn't currently include the literature data). You may inspect the pipeline stages of the same name in src/boilerdata/stages, such as pipeline.py to see the logic that runs during that stage. This example happens to use Python scripts, but you could define a stage in dvc.yaml that instead runs Matlab scripts, or any arbitrary action. This approach allows for the data process to be reliably reproduced over time, and for the process to be easily modified and extended in a collabroative effort.

There are other details of this process, such as the hosting of data in the data folder in a Google Cloud Bucket (alternatively it can be hosted on Google Drive), and more. This has to do with the need to store data (especially large datasets) outside of the repository, and access it in an authenticated fashion.

Project information

Contributors

Blake Naccarato
Blake Naccarato

💻

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boilerdata-2024.1.2.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

boilerdata-2024.1.2-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file boilerdata-2024.1.2.tar.gz.

File metadata

  • Download URL: boilerdata-2024.1.2.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for boilerdata-2024.1.2.tar.gz
Algorithm Hash digest
SHA256 71fe8c864fb3248a2143e7d260f73e3ffc8e167b47ca45fddcee050bd7532253
MD5 404ddc14d5472edfb032c07004e23f5c
BLAKE2b-256 b5d4f0efde1913f1b590eb59b89e7ef7aada69a5ed8e89ba971fd4647b576c7a

See more details on using hashes here.

File details

Details for the file boilerdata-2024.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for boilerdata-2024.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1675f9601f21bfcf2a6813962d045ff0f109b7e727d1defe8618d40674a861ca
MD5 94db6653b9d9cf7e2b2a34ac9b71ab0c
BLAKE2b-256 25c443f8c31dd83c1bfd2e81ead3a39fe85b7c5db7af78194611d20cd7e50783

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page