AutoML for Time Series.

These details have not been verified by PyPI

Project links

Homepage

Project description

DAI An open source project from Data to AI Lab at MIT.

Draco

AutoML for Time Series.

Draco

License: MIT
Documentation: https://sintel-dev.github.io/Draco
Homepage: https://github.com/sintel-dev/Draco

Overview

The Draco project is a collection of end-to-end solutions for machine learning problems commonly found in time series monitoring systems. Most tasks utilize sensor data emanating from monitoring systems. We utilize the foundational innovations developed for automation of machine Learning at Data to AI Lab at MIT.

The salient aspects of this customized project are:

A set of ready to use, well tested pipelines for different machine learning tasks. These are vetted through testing across multiple publicly available datasets for the same task.
An easy interface to specify the task, pipeline, and generate results and summarize them.
A production ready, deployable pipeline.
An easy interface to tune pipelines using Bayesian Tuning and Bandits library.
A community oriented infrastructure to incorporate new pipelines.
A robust continuous integration and testing infrastructure.
A learning database recording all past outcomes --> tasks, pipelines, outcomes.

Resources

Install

Requirements

Draco has been developed and runs on Python 3.6, 3.7 and 3.8.

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where you are trying to run Draco.

Download and Install

Draco can be installed locally using pip with the following command:

pip install draco-ml

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Data Format

The minimum input expected by the Draco system consists of the following two elements, which need to be passed as pandas.DataFrame objects:

Target Times

A table containing the specification of the problem that we are solving, which has three columns:

turbine_id: Unique identifier of the turbine which this label corresponds to.
cutoff_time: Time associated with this target
target: The value that we want to predict. This can either be a numerical value or a categorical label. This column can also be skipped when preparing data that will be used only to make predictions and not to fit any pipeline.

	turbine_id	cutoff_time	target
0	T1	2001-01-02 00:00:00	0
1	T1	2001-01-03 00:00:00	1
2	T2	2001-01-04 00:00:00	0

Readings

A table containing the signal data from the different sensors, with the following columns:

turbine_id: Unique identifier of the turbine which this reading comes from.
signal_id: Unique identifier of the signal which this reading comes from.
timestamp (datetime): Time where the reading took place, as a datetime.
value (float): Numeric value of this reading.

	turbine_id	signal_id	timestamp	value
0	T1	S1	2001-01-01 00:00:00	1
1	T1	S1	2001-01-01 12:00:00	2
2	T1	S1	2001-01-02 00:00:00	3
3	T1	S1	2001-01-02 12:00:00	4
4	T1	S1	2001-01-03 00:00:00	5
5	T1	S1	2001-01-03 12:00:00	6
6	T1	S2	2001-01-01 00:00:00	7
7	T1	S2	2001-01-01 12:00:00	8
8	T1	S2	2001-01-02 00:00:00	9
9	T1	S2	2001-01-02 12:00:00	10
10	T1	S2	2001-01-03 00:00:00	11
11	T1	S2	2001-01-03 12:00:00	12

Turbines

Optionally, a third table can be added containing metadata about the turbines. The only requirement for this table is to have a turbine_id field, and it can have an arbitraty number of additional fields.

	turbine_id	manufacturer	...	...	...
0	T1	Siemens	...	...	...
1	T2	Siemens	...	...	...

CSV Format

A part from the in-memory data format explained above, which is limited by the memory allocation capabilities of the system where it is run, Draco is also prepared to load and work with data stored as a collection of CSV files, drastically increasing the amount of data which it can work with. Further details about this format can be found in the project documentation site.

Quickstart

In this example we will load some demo data and classify it using a Draco Pipeline.

1. Load and split the demo data

The first step is to load the demo data.

For this, we will import and call the draco.demo.load_demo function without any arguments:

from draco.demo import load_demo

target_times, readings = load_demo()

The returned objects are:

target_times: A pandas.DataFrame with the target_times table data:

  turbine_id cutoff_time  target
0       T001  2013-01-12       0
1       T001  2013-01-13       0
2       T001  2013-01-14       0
3       T001  2013-01-15       1
4       T001  2013-01-16       0

readings: A pandas.DataFrame containing the time series data in the format explained above.

  turbine_id signal_id  timestamp  value
0       T001       S01 2013-01-10  323.0
1       T001       S02 2013-01-10  320.0
2       T001       S03 2013-01-10  284.0
3       T001       S04 2013-01-10  348.0
4       T001       S05 2013-01-10  273.0

Once we have loaded the target_times and before proceeding to training any Machine Learning Pipeline, we will have split them in 2 partitions for training and testing.

In this case, we will split them using the train_test_split function from scikit-learn, but it can be done with any other suitable tool.

from sklearn.model_selection import train_test_split

train, test = train_test_split(target_times, test_size=0.25, random_state=0)

Notice how we are only splitting the target_times data and not the readings. This is because the pipelines will later on take care of selecting the parts of the readings table needed for the training based on the information found inside the train and test inputs.

Additionally, if we want to calculate a goodness-of-fit score later on, we can separate the testing target values from the test table by popping them from it:

test_targets = test.pop('target')

2. Exploring the available Pipelines

Once we have the data ready, we need to find a suitable pipeline.

The list of available Draco Pipelines can be obtained using the draco.get_pipelines function.

from draco import get_pipelines

pipelines = get_pipelines()

The returned pipeline variable will be list containing the names of all the pipelines available in the Draco system:

['lstm',
 'lstm_with_unstack',
 'double_lstm',
 'double_lstm_with_unstack']

For the rest of this tutorial, we will select and use the pipeline lstm_with_unstack as our template.

pipeline_name = 'lstm_with_unstack'

3. Fitting the Pipeline

Once we have loaded the data and selected the pipeline that we will use, we have to fit it.

For this, we will create an instance of a DracoPipeline object passing the name of the pipeline that we want to use:

from draco.pipeline import DracoPipeline

pipeline = DracoPipeline(pipeline_name)

And then we can directly fit it to our data by calling its fit method and passing in the training target_times and the complete readings table:

pipeline.fit(train, readings)

4. Make predictions

After fitting the pipeline, we are ready to make predictions on new data by calling the pipeline.predict method passing the testing target_times and, again, the complete readings table.

predictions = pipeline.predict(test, readings)

5. Evaluate the goodness-of-fit

Finally, after making predictions we can evaluate how good the prediction was using any suitable metric.

from sklearn.metrics import f1_score

f1_score(test_targets, predictions)

What's next?

For more details about Draco and all its possibilities and features, please check the project documentation site Also do not forget to have a look at the tutorials!

History

0.3.0 - 2022-07-31

This release switches from MLPrimitives to ml-stars. Moreover, we remove all pipelines using deep feature synthesis.

Update demo bucket - Issue #76 by @sarahmish
Remove dfs based pipelines - Issue #73 by @sarahmish
Move from MLPrimitives to ml-stars - Issue #72 by @sarahmish

0.2.0 - 2022-04-12

This release features a reorganization and renaming of Draco pipelines. In addtion, we update some of the dependencies for general housekeeping.

Update Draco dependencies - Issue #66 by @sarahmish
Reorganize pipelines - Issue #63 by @sarahmish

0.1.0 - 2022-01-01

First release on draco-ml PyPI

Previous GreenGuard development

0.3.0 - 2021-01-22

This release increases the supported version of python to 3.8 and also includes changes in the installation requirements, where pandas and scikit-optimize packages have been updated to support higher versions. This changes come together with the newer versions of MLBlocks and MLPrimitives.

Internal Improvements

Fix run_benchmark generating properly the init_hyperparameters for the pipelines.
New FPR metric.
New roc_auc_score metric.
Multiple benchmarking metrics allowed.
Multiple tpr or threshold values allowed for the benchmark.

0.2.6 - 2020-10-23

Fix mkdir when exporting to csv file the benchmark results.
Intermediate steps for the pipelines with demo notebooks for each pipeline.

Resolved Issues

Issue #50: Expose partial outputs and executions in the GreenGuardPipeline.

0.2.5 - 2020-10-09

With this release we include:

run_benchmark: A function within the module benchmark that allows the user to evaluate templates against problems with different window size and resample rules.
summarize_results: A function that given a csv file generates a xlsx file with a summary tab and a detailed tab with the results from run_benchmark.

0.2.4 - 2020-09-25

Fix dependency errors

0.2.3 - 2020-08-10

Added benchmarking module.

0.2.2 - 2020-07-10

Internal Improvements

Added github actions.

Resolved Issues

Issue #27: Cache Splits pre-processed data on disk

0.2.1 - 2020-06-16

With this release we give the possibility to the user to specify more than one template when creating a GreenGuardPipeline. When the tune method of this is called, an instance of BTBSession is returned and it is in charge of selecting the templates and tuning their hyperparameters until achieving the best pipeline.

Internal Improvements

Resample by filename inside the CSVLoader to avoid oversampling of data that will not be used.
Select targets now allows them to be equal.
Fixed the csv filename format.
Upgraded to BTB.

Bug Fixes

Issue #33: Wrong default datetime format

Resolved Issues

Issue #35: Select targets is too strict
Issue #36: resample by filename inside csvloader
Issue #39: Upgrade BTB
Issue #41: Fix CSV filename format

0.2.0 - 2020-02-14

First stable release:

efficient data loading and preprocessing
initial collection of dfs and lstm based pipelines
optimized pipeline tuning
documentation and tutorials

0.1.0

First release on PyPI

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.0

Jul 31, 2023

0.2.1.dev1 pre-release

Jul 31, 2023

0.2.1.dev0 pre-release

Apr 26, 2023

0.2.0

Apr 12, 2022

0.1.1.dev1 pre-release

Apr 12, 2022

0.1.1.dev0 pre-release

Apr 12, 2022

0.1.0

Jan 1, 2022

0.1.0.dev1 pre-release

Nov 5, 2021

0.1.0.dev0 pre-release

Jan 1, 2022

0.0.1.dev0 pre-release

Jan 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

draco-ml-0.3.0.tar.gz (917.1 kB view details)

Uploaded Jul 31, 2023 Source

Built Distribution

draco_ml-0.3.0-py2.py3-none-any.whl (42.7 kB view details)

Uploaded Jul 31, 2023 Python 2Python 3

File details

Details for the file draco-ml-0.3.0.tar.gz.

File metadata

Download URL: draco-ml-0.3.0.tar.gz
Upload date: Jul 31, 2023
Size: 917.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.28.2 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.49.0 importlib-metadata/4.13.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.8.16

File hashes

Hashes for draco-ml-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`fa8200c9dae4ec2189c1cfe13f5fcaad7766f272fd8da65d79b0abdaf65f794f`
MD5	`31a24ffcfddf38dc5bebaecda924184b`
BLAKE2b-256	`eacf25b803c59b94e75c68e654c0fec10c669371a7ecdf456f3266d8435608f6`

See more details on using hashes here.

File details

Details for the file draco_ml-0.3.0-py2.py3-none-any.whl.

File metadata

Download URL: draco_ml-0.3.0-py2.py3-none-any.whl
Upload date: Jul 31, 2023
Size: 42.7 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.28.2 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.49.0 importlib-metadata/4.13.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.8.16

File hashes

Hashes for draco_ml-0.3.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`43f6f61f31fe9b4f9fbf8ddf3d34317f785c370efe42bfde2afea07a4ad6a986`
MD5	`9d4c6820db1bdbd46a50521401321043`
BLAKE2b-256	`754f4b214badc8f4435ea1f5de76ffa2fe7550380968091f523159ae10dbc4f3`

See more details on using hashes here.

draco-ml 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Draco

Overview

Resources

Install

Requirements

Download and Install

Data Format

Target Times

Readings

Turbines

CSV Format

Quickstart

1. Load and split the demo data

2. Exploring the available Pipelines

3. Fitting the Pipeline

4. Make predictions

5. Evaluate the goodness-of-fit

What's next?

History

0.3.0 - 2022-07-31

0.2.0 - 2022-04-12

0.1.0 - 2022-01-01

Previous GreenGuard development

0.3.0 - 2021-01-22

Internal Improvements

0.2.6 - 2020-10-23

Resolved Issues

0.2.5 - 2020-10-09

0.2.4 - 2020-09-25

0.2.3 - 2020-08-10

0.2.2 - 2020-07-10

Internal Improvements

Resolved Issues

0.2.1 - 2020-06-16

Internal Improvements

Bug Fixes

Resolved Issues

0.2.0 - 2020-02-14

0.1.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes