Pipeline Profiler tool. Enables the exploration of D3M pipelines in Jupyter Notebooks
Project description
PipelineProfiler
AutoML Pipeline exploration tool compatible with Jupyter Notebooks. Supports auto-sklearn and D3M pipeline format.
(Shift click to select multiple pipelines)
Paper: https://arxiv.org/abs/2005.00160
Video: https://youtu.be/2WSYoaxLLJ8
Blog: Medium post
Demo
Live demo (Google Colab):
In Jupyter Notebook:
import PipelineProfiler
data = PipelineProfiler.get_heartstatlog_data()
PipelineProfiler.plot_pipeline_matrix(data)
Install
Option 1: install via pip:
pip install pipelineprofiler
Option 2: Run the docker image:
docker build -t pipelineprofiler .
docker run -p 9999:8888 pipelineprofiler
Then copy the access token and log in to jupyter in the browser url:
localhost:9999
Data preprocessing
PipelineProfiler reads data from the D3M Metalearning database. You can download this data from: https://metalearning.datadrivendiscovery.org/dumps/2020/03/04/metalearningdb_dump_20200304.tar.gz
You need to merge two files in order to explore the pipelines: pipelines.json and pipeline_runs.json. To do so, run
python -m PipelineProfiler.pipeline_merge [-n NUMBER_PIPELINES] pipeline_runs_file pipelines_file output_file
Pipeline exploration
import PipelineProfiler
import json
In a jupyter notebook, load the output_file
with open("output_file.json", "r") as f:
pipelines = json.load(f)
and then plot it using:
PipelineProfiler.plot_pipeline_matrix(pipelines[:10])
Data postprocessing
You might want to group pipelines by problem type, and select the top k pipelines from each team. To do so, use the code:
def get_top_k_pipelines_team(pipelines, k):
team_pipelines = defaultdict(list)
for pipeline in pipelines:
source = pipeline['pipeline_source']['name']
team_pipelines[source].append(pipeline)
for team in team_pipelines.keys():
team_pipelines[team] = sorted(team_pipelines[team], key=lambda x: x['scores'][0]['normalized'], reverse=True)
team_pipelines[team] = team_pipelines[team][:k]
new_pipelines = []
for team in team_pipelines.keys():
new_pipelines.extend(team_pipelines[team])
return new_pipelines
def sort_pipeline_scores(pipelines):
return sorted(pipelines, key=lambda x: x['scores'][0]['value'], reverse=True)
pipelines_problem = {}
for pipeline in pipelines:
problem_id = pipeline['problem']['id']
if problem_id not in pipelines_problem:
pipelines_problem[problem_id] = []
pipelines_problem[problem_id].append(pipeline)
for problem in pipelines_problem.keys():
pipelines_problem[problem] = sort_pipeline_scores(get_top_k_pipelines_team(pipelines_problem[problem], k=100))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pipelineprofiler-0.1.16-py3.6.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 485a45192692b5089147cb82c33f780302d1de31b2b53a99031422bd90df66f9 |
|
MD5 | 64fd53bc7719e17cf17ca24c1f3be052 |
|
BLAKE2b-256 | ea3ce3358c81c14f0bbb7fc37e46fb171afeca710b3af4caa060d74010de9fc6 |
Hashes for pipelineprofiler-0.1.16-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84b787d98f155a84fd327ae9f2e8c8d3e229dd0933eba722d983dea100a5111e |
|
MD5 | b14995f4535f6ccc59c23d4cde3565bb |
|
BLAKE2b-256 | 597f949b9185d2876c0dc0e947a71ff0e70088fa3a4d424581e6fd2720b6a956 |