AWS EMR Launch modules
Project description
EMR Launch
An AWS Professional Service open source initiative | aws-proserve-opensource@amazon.com
The intent of the EMR Launch library is to simplify the development experience for Builders defining, deploying, managing, and using EMR Clusters by:
- defining reusable Security, Resource, and Launch Configurations enabling developers to Define Once and Reuse
- separating the definition of Cluster Security Configurations and Cluster Resource Configurations into reusable and shareable Constructs
- providing a suite of Tools to simplify the construction of Orchestration Pipelines using Step Functions and EMR Clusters
Concepts (and Constructs)
This library utilizes the AWS CDK for deployment and management of resources. It is recommended that users familiarize themselves with the CDK's basic concepts and usage.
EMR Profile
An EMR Profile (emr_profile
) is a reusable definition of the security profile used by an EMR Cluster. This includes:
- Service Role: an IAM Role used by the EMR Service to manage the Cluster
- Instance Role: an IAM Role used by the EC2 Instances in an EMR Cluster
- AutoScaling Role: an IAM Role used to autoscale and resize an EMR Cluster
- Service Group: a Security Group granting the EMR Service basic access to EC2 Instances in Cluster. This is required to deploy Instances into a Private Subnet.
- Master Group: the Security Group assigned to the EMR Master Instance
- Workers Group: the Security Group assigned to the EMR Worker Instances (Core and Task nodes)
- Security Configuration: the Security Configuration used by the Cluster
- Kerberos Attributes: the attributes required to enable Kerberos authentication
Each emr_profile
requires a unique profile_name
. This name and the namespace
uniquely identify a profile. The namespace
is a logical grouping of profiles and has a default value of "default".
Deploying an emr_profile
creates these resources and stores the profile definition and metadata in the Parameter Store. The Profile can either be used immediately in the Stack when it is defined, or reused in other Stacks by loading the Profile definition by profile_name
and namespace
.
Cluster Configuration
A Cluster Configuration (cluster_configuration
) is a reusable definition of the physical resources in an EMR Cluster. This incudes:
- EMR Release Label: the EMR release version (e.g.
emr-5.28.0
) - Applications: the Applications to install on Cluster (e.g. Hadoop, Hive, SparK)
- Bootstrap Actions: the Bootstrap Actions to execute on each node after Applications have been installed
- Configurations: configuration parameters to set for the various Applications installed
- Step Concurrency Level: the number of concurrent Steps the Cluster is configured to run
- Instances: the configuration of the Master, Core, and Task nodes in the Cluster (e.g. Master Instance Type, Core Instance Type, Core Instance Count, etc)
Like the emr_profile
, each cluster_configuration
requires a unique configuration_name
. This name and the namespace
uniquely identify a configuration.
Deploying a cluster_configuration
stores the configuration definition and metadata in the Parameter Store. The Configuration can either be used immediately in the Stack when it is defined, or reused in other Stacks by loading the Configuration definition by configuration_name
and namespace
.
EMR Launch Function
An EMR Launch Function (emr_launch_function
) is an AWS Step Functions State Machine that launches an EMR Cluster. The Launch Function is defined with an emr_profile
, cluster_configuration
, cluster_name
, and tags
. When the function is executed it creates an EMR Cluster with the given name, tags, security profile, and physical resources then synchronously monitors the cluster for successful start.
To be clear, deploying an emr_launch_function
does not create an EMR Cluster, it only creates the State Machine. The cluster is created when the State Machine is executed.
The emr_launch_function
is a mechanism for easily combining the reusable emr_profile
and cluster_configuration
.
Like the emr_profile
and cluster_configuration
, each emr_launch_function
requires a unique launch_function_name
. This name and the namespace
uniquely identify the launch function.
Chains and Tasks
Chains and Tasks are preconfigured components that simplify the use of AWS Step Function State Machines as orchestrators of data processing pipelines. These components allow the developer to easily build complex, serverless pipelines using EMR Clusters (both Transient and Persistent), Lambdas, and nested State Machines.
Security
Care is taken to ensure that emr_launch_functions
and emr_profiles
can't be used to create clusters with elevated or unintended privileges.
- IAM policies can be used to restrict the Users and Roles that can create EMR Clusters by granting
states:StartExecution
to specific State Machine ARNs. - By storing the metadata and configuration of
emr_profiles
,cluster_configurations
, andemr_launch_functions
in the Systems Manager Parameter Store, IAM Policies can be used to grant or restrict Read/Write access to these- Access can be managed for ALL metadata and configurations, specific namespaces, or individual ARNs
- Each
emr_launch_function
uses a specific AWS Lambda function to load and combine its specificemr_profile
andcluster_configuration
. The IAM Policy associated with this Lambda allows it to read only these specific ARNs from the Parameter Store. - Each
emr_launch_function
is grantediam:PassRole
to the specific EMR Roles defined in theemr_profile
assigned to the launch function. Attempting to change the Roles used by directly modifying the metadata of theemr_profile
in the Parameter Store will result in a cluster launch failure.
Usage
This library acts as a plugin to the AWS CDK providing additional L2 Constructs.
To avoid circular references with CDK dependencies this package will not install CDK and Boto3. These should be
installed manually from one of the requirements.txt
files (depending on the version of aws-emr-launch
).
It is recommended that a Python3 venv
be used for all CDK builds and deployments.
To get up and running quickly:
Prerequisites
The AWS CDK v2.x utilizes containers to automate some tasks. EMR Launch uses and deploys a CDK PythonLayerVersion
, this Construct uses a container to create the bundle for the Lambda Layer. As such, a docker
runtime is required to deploy.
Deployment
-
Install the CDK CLI
npm install -g aws-cdk
-
Use your mechanism of choice to create and activate a Python3
venv
:python3 -m venv .env source .env/bin/activate
-
Install the CDK and Boto3 minimum requirements:
pip install -r requirements-2.x.txt
-
Install
aws-emr-launch
package:pip install aws-emr-launch
Development
Follow Steps 1 - 3 above to configure an environment and install requirements
After activating your venv
:
-
Install development requirements:
pip install -r requirements-dev.txt
-
Install the library locally:
pip install -e .
Managing Layer Packages
Update the aws_emr_launch/lambda_sources/layers/emr_config_utils/requirements.txt
adding/updating/removing package(s)
Testing
To run the test suite (from within the venv
):
pytest
After running tests
View test coverage reports by opening htmlcov/index.html
in your web browser.
To write a test
- start a file named test_[the module you want to test].py
- import the module you want to test at the top of the file
- write test case functions that match either
test_*
or*_test
For more information refer to pytest docs
Contributing
See CONTRIBUTING for more information.
License
This project is licensed under the terms of the Apache 2.0 license. See LICENSE
.
Included AWS Lambda functions are licensed under the MIT-0 license. See LICENSE-LAMBDA
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file aws_emr_launch-2.0.1-py3-none-any.whl
.
File metadata
- Download URL: aws_emr_launch-2.0.1-py3-none-any.whl
- Upload date:
- Size: 135.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.2.0 pkginfo/1.8.2 requests/2.31.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e802c966a3f5a4a02e2b32997367314985e095d9e4f176f176e416f376ec3832 |
|
MD5 | c2447a1a2b1968887c54fa9e9203b332 |
|
BLAKE2b-256 | 083ba51270cd1ad9473312184de0e6b55677c4b15f8e458ce720d13fc73161ae |