AutoML for Renewable Energy Industries.
Project description
An open source project from Data to AI Lab at MIT.
AutoML for Renewable Energy Industries.
GreenGuard
- License: MIT
- Documentation: https://signals-dev.github.io/GreenGuard
- Homepage: https://github.com/signals-dev/GreenGuard
Overview
The GreenGuard project is a collection of end-to-end solutions for machine learning problems commonly found in monitoring wind energy production systems. Most tasks utilize sensor data emanating from monitoring systems. We utilize the foundational innovations developed for automation of machine Learning at Data to AI Lab at MIT.
The salient aspects of this customized project are:
- A set of ready to use, well tested pipelines for different machine learning tasks. These are vetted through testing across multiple publicly available datasets for the same task.
- An easy interface to specify the task, pipeline, and generate results and summarize them.
- A production ready, deployable pipeline.
- An easy interface to
tune
pipelines using Bayesian Tuning and Bandits library. - A community oriented infrastructure to incorporate new pipelines.
- A robust continuous integration and testing infrastructure.
- A
learning database
recording all past outcomes --> tasks, pipelines, outcomes.
Resources
Install
Requirements
GreenGuard has been developed and runs on Python 3.6, 3.7 and 3.8.
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where you are trying to run GreenGuard.
Download and Install
GreenGuard can be installed locally using pip with the following command:
pip install greenguard
This will pull and install the latest stable release from PyPi.
If you want to install from source or contribute to the project please read the Contributing Guide.
Docker usage
GreenGuard is prepared to be run inside a docker environment. Please check the docker documentation for details about how to run GreenGuard using docker.
Data Format
The minimum input expected by the GreenGuard system consists of the following two elements,
which need to be passed as pandas.DataFrame
objects:
Target Times
A table containing the specification of the problem that we are solving, which has three columns:
turbine_id
: Unique identifier of the turbine which this label corresponds to.cutoff_time
: Time associated with this targettarget
: The value that we want to predict. This can either be a numerical value or a categorical label. This column can also be skipped when preparing data that will be used only to make predictions and not to fit any pipeline.
turbine_id | cutoff_time | target | |
---|---|---|---|
0 | T1 | 2001-01-02 00:00:00 | 0 |
1 | T1 | 2001-01-03 00:00:00 | 1 |
2 | T2 | 2001-01-04 00:00:00 | 0 |
Readings
A table containing the signal data from the different sensors, with the following columns:
turbine_id
: Unique identifier of the turbine which this reading comes from.signal_id
: Unique identifier of the signal which this reading comes from.timestamp (datetime)
: Time where the reading took place, as a datetime.value (float)
: Numeric value of this reading.
turbine_id | signal_id | timestamp | value | |
---|---|---|---|---|
0 | T1 | S1 | 2001-01-01 00:00:00 | 1 |
1 | T1 | S1 | 2001-01-01 12:00:00 | 2 |
2 | T1 | S1 | 2001-01-02 00:00:00 | 3 |
3 | T1 | S1 | 2001-01-02 12:00:00 | 4 |
4 | T1 | S1 | 2001-01-03 00:00:00 | 5 |
5 | T1 | S1 | 2001-01-03 12:00:00 | 6 |
6 | T1 | S2 | 2001-01-01 00:00:00 | 7 |
7 | T1 | S2 | 2001-01-01 12:00:00 | 8 |
8 | T1 | S2 | 2001-01-02 00:00:00 | 9 |
9 | T1 | S2 | 2001-01-02 12:00:00 | 10 |
10 | T1 | S2 | 2001-01-03 00:00:00 | 11 |
11 | T1 | S2 | 2001-01-03 12:00:00 | 12 |
Turbines
Optionally, a third table can be added containing metadata about the turbines.
The only requirement for this table is to have a turbine_id
field, and it can have
an arbitraty number of additional fields.
turbine_id | manufacturer | ... | ... | ... | |
---|---|---|---|---|---|
0 | T1 | Siemens | ... | ... | ... |
1 | T2 | Siemens | ... | ... | ... |
CSV Format
A part from the in-memory data format explained above, which is limited by the memory allocation capabilities of the system where it is run, GreenGuard is also prepared to load and work with data stored as a collection of CSV files, drastically increasing the amount of data which it can work with. Further details about this format can be found in the project documentation site.
Quickstart
In this example we will load some demo data and classify it using a GreenGuard Pipeline.
1. Load and split the demo data
The first step is to load the demo data.
For this, we will import and call the greenguard.demo.load_demo
function without any arguments:
from greenguard.demo import load_demo
target_times, readings = load_demo()
The returned objects are:
-
target_times
: Apandas.DataFrame
with thetarget_times
table data:turbine_id cutoff_time target 0 T001 2013-01-12 0 1 T001 2013-01-13 0 2 T001 2013-01-14 0 3 T001 2013-01-15 1 4 T001 2013-01-16 0
-
readings
: Apandas.DataFrame
containing the time series data in the format explained above.turbine_id signal_id timestamp value 0 T001 S01 2013-01-10 323.0 1 T001 S02 2013-01-10 320.0 2 T001 S03 2013-01-10 284.0 3 T001 S04 2013-01-10 348.0 4 T001 S05 2013-01-10 273.0
Once we have loaded the target_times
and before proceeding to training any Machine Learning
Pipeline, we will have split them in 2 partitions for training and testing.
In this case, we will split them using the train_test_split function from scikit-learn, but it can be done with any other suitable tool.
from sklearn.model_selection import train_test_split
train, test = train_test_split(target_times, test_size=0.25, random_state=0)
Notice how we are only splitting the target_times
data and not the readings
.
This is because the pipelines will later on take care of selecting the parts of the
readings
table needed for the training based on the information found inside
the train
and test
inputs.
Additionally, if we want to calculate a goodness-of-fit score later on, we can separate the
testing target values from the test
table by popping them from it:
test_targets = test.pop('target')
2. Exploring the available Pipelines
Once we have the data ready, we need to find a suitable pipeline.
The list of available GreenGuard Pipelines can be obtained using the greenguard.get_pipelines
function.
from greenguard import get_pipelines
pipelines = get_pipelines()
The returned pipeline
variable will be list
containing the names of all the pipelines
available in the GreenGuard system:
['classes.unstack_double_lstm_timeseries_classifier',
'classes.unstack_lstm_timeseries_classifier',
'classes.unstack_normalize_dfs_xgb_classifier',
'classes.unstack_dfs_xgb_classifier',
'classes.normalize_dfs_xgb_classifier']
For the rest of this tutorial, we will select and use the pipeline
classes.normalize_dfs_xgb_classifier
as our template.
pipeline_name = 'classes.normalize_dfs_xgb_classifier'
3. Fitting the Pipeline
Once we have loaded the data and selected the pipeline that we will use, we have to fit it.
For this, we will create an instance of a GreenGuardPipeline
object passing the name
of the pipeline that we want to use:
from greenguard.pipeline import GreenGuardPipeline
pipeline = GreenGuardPipeline(pipeline_name)
And then we can directly fit it to our data by calling its fit
method and passing in the
training target_times
and the complete readings
table:
pipeline.fit(train, readings)
4. Make predictions
After fitting the pipeline, we are ready to make predictions on new data by calling the
pipeline.predict
method passing the testing target_times
and, again, the complete
readings
table.
predictions = pipeline.predict(test, readings)
5. Evaluate the goodness-of-fit
Finally, after making predictions we can evaluate how good the prediction was using any suitable metric.
from sklearn.metrics import f1_score
f1_score(test_targets, predictions)
What's next?
For more details about GreenGuard and all its possibilities and features, please check the project documentation site Also do not forget to have a look at the tutorials!
History
0.3.0 - 2021-01-22
This release increases the supported version of python to 3.8
and also includes changes
in the installation requirements, where pandas
and scikit-optimize
packages have
been updated to support higher versions. This changes come together with the newer versions
of MLBlocks
and MLPrimitives
.
Internal Improvements
- Fix
run_benchmark
generating properly theinit_hyperparameters
for the pipelines. - New
FPR
metric. - New
roc_auc_score
metric. - Multiple benchmarking metrics allowed.
- Multiple
tpr
orthreshold
values allowed for the benchmark.
0.2.6 - 2020-10-23
- Fix
mkdir
when exporting tocsv
file the benchmark results. - Intermediate steps for the pipelines with demo notebooks for each pipeline.
Resolved Issues
- Issue #50: Expose partial outputs and executions in the
GreenGuardPipeline
.
0.2.5 - 2020-10-09
With this release we include:
run_benchmark
: A function within the modulebenchmark
that allows the user to evaluate templates against problems with different window size and resample rules.summarize_results
: A function that given acsv
file generates axlsx
file with a summary tab and a detailed tab with the results fromrun_benchmark
.
0.2.4 - 2020-09-25
- Fix dependency errors
0.2.3 - 2020-08-10
- Added benchmarking module.
0.2.2 - 2020-07-10
Internal Improvements
- Added github actions.
Resolved Issues
- Issue #27: Cache Splits pre-processed data on disk
0.2.1 - 2020-06-16
With this release we give the possibility to the user to specify more than one template when
creating a GreenGuardPipeline. When the tune
method of this is called, an instance of BTBSession
is returned and it is in charge of selecting the templates and tuning their hyperparameters until
achieving the best pipeline.
Internal Improvements
- Resample by filename inside the
CSVLoader
to avoid oversampling of data that will not be used. - Select targets now allows them to be equal.
- Fixed the csv filename format.
- Upgraded to BTB.
Bug Fixes
- Issue #33: Wrong default datetime format
Resolved Issues
- Issue #35: Select targets is too strict
- Issue #36: resample by filename inside csvloader
- Issue #39: Upgrade BTB
- Issue #41: Fix CSV filename format
0.2.0 - 2020-02-14
First stable release:
- efficient data loading and preprocessing
- initial collection of dfs and lstm based pipelines
- optimized pipeline tuning
- documentation and tutorials
0.1.0
- First release on PyPI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file greenguard-0.3.0.tar.gz
.
File metadata
- Download URL: greenguard-0.3.0.tar.gz
- Upload date:
- Size: 916.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ff52dcdea1c15c4eea1b9427001abf16341778e89615f36018d8479bd9af867 |
|
MD5 | 37992e2ec4995c835462f55c84edb151 |
|
BLAKE2b-256 | c4190378a98166d083fb771c23efcc530c1e109cc272fb7b2e3fe38bee3160a5 |
File details
Details for the file greenguard-0.3.0-py2.py3-none-any.whl
.
File metadata
- Download URL: greenguard-0.3.0-py2.py3-none-any.whl
- Upload date:
- Size: 51.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3fd1c7cb50e3ec61de376f55185c85216f79dbb9b33d63e9710555166150149d |
|
MD5 | 6e2a4c093e50e799d029180b7cddc34e |
|
BLAKE2b-256 | dc17f4c38e1928f7258a55f10b91c60c1c028d9372d9306426a059b53b9bbcd8 |