DataRobot Airflow provider.
Project description
DataRobot Provider for Apache Airflow
This package provides operators, sensors, and a hook that integrates DataRobot into Apache Airflow. Using these components you should be able to build the essential DataRobot pipeline - create a project, train models, deploy a model, score predictions against the model deployment.
Installation
Prerequisites: Apache Airflow
Install the DataRobot provider:
pip install airflow-provider-datarobot
Connection
In the Airflow user interface, create a new DataRobot connection in Admin > Connections
:
- Connection Type:
DataRobot
- Connection Id:
datarobot_default
(default) - API Key:
your-datarobot-api-key
- DataRobot endpoint URL:
https://app.datarobot.com/api/v2
(default)
Create the API Key in the DataRobot Developer Tools page, API Keys
section (see DataRobot Docs for more details).
By default, all components use datarobot_default
connection ID.
Config JSON for dag run
Operators and sensors use parameters from the config which must be submitted when triggering the dag. Example config JSON with required parameters:
{
"training_data": "s3-pre-signed-url-of-training-data",
"project_name": "Project created from Airflow",
"autopilot_settings": {
"target": "readmitted"
},
"deployment_label": "Deployment created from Airflow",
"score_settings": {
"intake_settings": {
"type": "s3",
"url": "s3://path/to/scoring-data/Diabetes10k.csv",
"credential_id": "62160b511fb29da8dd5f2c81"
},
"output_settings": {
"type": "s3",
"url": "s3://path/to/results-dir/Diabetes10k_predictions.csv",
"credential_id": "62160b511fb29da8dd5f2c81"
}
}
}
These config values can be accessed in the execute()
method of any operator the dag
in the context["params"]
variable, e.g. getting a training data you would use this in the operator:
def execute(self, context: Dict[str, Any]) -> str:
...
training_data = context["params"]["training_data"]
...
Modules
Operators
-
CreateProjectOperator
- creates a DataRobot project and returns its IDRequired config params:
training_data: str - pre-signed S3 URL to training dataset project_name: str - project name
The
training_data
value must be a pre-signed AWS S3 URL.For more project settings see the DataRobot docs.
-
TrainModelsOperator
- triggers DataRobot Autopilot to train modelsParameters:
project_id: str - DataRobot project ID
Required config params:
"autopilot_settings": { "target": "readmitted" }
target
is a required parameter with the column name which defines the modeling target.For more autopilot settings see the DataRobot docs.
-
DeployModelOperator
- deploys a specified model to production and returns its IDParameters:
model_id: str - DataRobot model ID
Required config params:
deployment_label - deployment label/name
For more deployment settings see the DataRobot docs.
-
DeployRecommendedModelOperator
- deploys a recommended model to production and returns its IDParameters:
project_id: str - DataRobot project ID
Required config params:
deployment_label: str - deployment label
For more deployment settings see the DataRobot docs.
-
ScorePredictionsOperator
- scores predictions against the deployment and returns a batch prediction job IDPrerequisites:
- S3 credentials added to DataRobot via Python API client.
You need the
creds.credential_id
for thecredential_id
parameter in the config.
Parameters:
deployment_id: str - DataRobot project ID
Required config params:
"score_settings": { "intake_settings": { "type": "s3", "url": "s3://my-bucket/Diabetes10k.csv", "credential_id": "62160b511fb29da8dd5f2c81" }, "output_settings": { "type": "s3", "url": "s3://my-bucket/Diabetes10k_predictions.csv", "credential_id": "62160b511fb29da8dd5f2c81" } }
For more batch prediction settings see the DataRobot docs.
- S3 credentials added to DataRobot via Python API client.
You need the
Sensors
-
AutopilotCompleteSensor
- check whether the Autopilot has completedParameters:
project_id: str - DataRobot project ID
-
ScoringCompleteSensor
- check whether batch scoring has completedParameters:
job_id: str - Batch prediction job ID
Hooks
DataRobotHook
- a hook for initializing DataRobot Public API client
Pipeline
The modules described above allows to construct a standard DataRobot pipeline in an Airflow dag:
create_project_op >> train_models_op >> autopilot_complete_sensor >> deploy_model_op >> score_predictions_op >> scoring_complete_sensor
Examples
See the examples directory for the example DAGs.
Issues
Please submit issues and pull requests in our official repo: https://github.com/datarobot/airflow-provider-datarobot
We are happy to hear from you. Please email any feedback to the authors at support@datarobot.com.
Copyright Notice
Copyright 2022 DataRobot, Inc. and its affiliates.
All rights reserved.
This is proprietary source code of DataRobot, Inc. and its affiliates.
Released under the terms of DataRobot Tool and Utility Agreement.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for airflow-provider-datarobot-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76421130641c0860b30e587411b1e58083ffe4c4baed9219ca78568de5cd61f5 |
|
MD5 | 5939b77f4b3ead09ed9b11ef3fa19c0e |
|
BLAKE2b-256 | 0e61acdd01fb75b73ede88805d93216cd5b7269ce3fae8c5b4a49c8aa01dfa6a |
Hashes for airflow_provider_datarobot-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70d8500da6c732a825c75c6f7ceca705aa66cac8716dcd6bc64ffe6d9643f231 |
|
MD5 | 5fe5ca56df0d47cf33709071f8311b11 |
|
BLAKE2b-256 | b676f154be6c01e1dda606c8a6693b656b75ba0551094cf0fffb7698879779af |