A powerful command-line tool and API for managing Spark jobs on Amazon EMR clusters
Project description
EMRRunner
A powerful command-line tool for managing and deploying Python-based (e.g., PySpark) data pipeline jobs on Amazon EMR clusters.
๐ Features
- Command-line interface for quick job submission
- Basic POST API for fast job submission
- Support for both client and cluster deploy modes
๐ Prerequisites
- Python 3.9+
- AWS Account with EMR access
- Configured AWS credentials
- Active EMR cluster
๐ ๏ธ Installation
From PyPI
pip install emrrunner
From Source
# Clone the repository
git clone https://github.com/Haabiy/EMRRunner.git && cd EMRRunner
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate
# Install the package
pip install -e .
โ๏ธ Configuration
AWS Configuration
Create a .env file in the project root with your AWS configuration or export these variables in your terminal before running:
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION="your_region"
export EMR_CLUSTER_ID="your_cluster_id"
export S3_PATH="s3://your-bucket/path" # The path to your jobs (the directory containing your job_package.zip file)...see `S3 Job Structure` below
or a better approach โ instead of exporting these variables in each terminal session, you can add them permanently to your terminal by editing your ~/.zshrc file:
- Open your
~/.zshrcfile:nano ~/.zshrc - Add the following lines at the end of the file (replace with your own AWS credentials):
export AWS_ACCESS_KEY_ID="your_access_key" export AWS_SECRET_ACCESS_KEY="your_secret_key" export AWS_REGION="your_region" export EMR_CLUSTER_ID="your_cluster_id" export S3_PATH="s3://your-bucket/path"
- Save and exit the file (
Ctrl + X). - To apply the changes immediately, run:
source ~/.zshrc
Now, you wonโt have to export the variables manually in each session, and theyโll be available whenever you open a new terminal session.
Bootstrap Actions
For EMR cluster setup with required dependencies, create a bootstrap script (e.g.: bootstrap.sh);
#!/bin/bash -xe
# Example structure of a bootstrap script
# Create and activate virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate
# Install system dependencies
sudo yum install python3-pip -y
sudo yum install -y [your-system-packages]
# Install Python packages
pip3 install [your-required-packages]
deactivate
E.g
#!/bin/bash -xe
# Create and activate a virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate
# Install pip for Python 3.x
sudo yum install python3-pip -y
# Install required packages
pip3 install \
pyspark==3.5.5 \
deactivate
Upload the bootstrap script to S3 and reference it in your EMR cluster configuration.
๐ Project Structure
EMRRunner/
โโโ Dockerfile
โโโ LICENSE.md
โโโ README.md
โโโ app/
โ โโโ __init__.py
โ โโโ cli.py # Command-line interface
โ โโโ config.py # Configuration management
โ โโโ emr_client.py # EMR interaction logic
โ โโโ emr_job_api.py # Flask API endpoints
โ โโโ run_api.py # API server runner
โ โโโ schema.py # Request/Response schemas
โโโ bootstrap/
โ โโโ bootstrap.sh # EMR bootstrap script
โโโ tests/
โ โโโ __init__.py
โ โโโ test_config.py
โ โโโ test_emr_job_api.py
โ โโโ test_schema.py
โโโ pyproject.toml
โโโ requirements.txt
โโโ setup.py
๐ฆ S3 Job Structure
The S3_PATH in your configuration should point to a bucket with the following structure:
s3://your-bucket/
โโโ jobs/
โ โโโ job1/
โ โ โโโ job_package.zip # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.
โ โโโ job2/
โ โ โโโ job_package.zip # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.
Job Script (main.py)
Your job script should include the necessary logic for executing the tasks in your data pipeline, using functions from your dependencies.
Example of main.py:
from dependencies import clean, transform, sink # Import your core job functions
def main():
# Step 1: Clean the data
clean()
# Step 2: Transform the data
transform()
# Step 3: Sink (store) the processed data
sink()
if __name__ == "__main__":
main() # Execute the main function when the script is run
๐ป Usage
Command Line Interface
Start a job in client mode:
emrrunner start --job job1
Start a job in cluster mode:
emrrunner start --job job1 --deploy-mode cluster
API Endpoints
Start a job via API in client mode (default):
curl -X POST http://localhost:8000/emrrunner/start \
-H "Content-Type: application/json" \
-d '{"job": "job1"}'
Start a job via API in cluster mode:
curl -X POST http://localhost:8000/emrrunner/start \
-H "Content-Type: application/json" \
-d '{"job": "job1", "deploy_mode": "cluster"}'
๐ง Development
To contribute to EMRRunner:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
๐ก Best Practices
-
Bootstrap Actions
- Keep bootstrap scripts modular
- Version control your dependencies
- Use specific package versions
- Test bootstrap scripts locally when possible
- Store bootstrap scripts in S3 with versioning enabled
-
Job Dependencies
- Maintain a requirements.txt for each job
- Document system-level dependencies
- Test dependencies in a clean environment
-
Job Organization
- Follow the standard structure for jobs
- Use clear naming conventions
- Document all functions and modules
๐ License
This project is licensed under the MIT License - see the LICENSE.md file for details.
๐ฅ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ Bug Reports
If you discover any bugs, please create an issue on GitHub with:
- Any details about your local setup that might be helpful in troubleshooting
- Detailed steps to reproduce the bug
Built with โค๏ธ using Python and AWS EMR
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file emrrunner-1.0.10.tar.gz.
File metadata
- Download URL: emrrunner-1.0.10.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7fd83a42a163cbd15e0437a20c809aebf42bcb9dea7583ca4ac0d2989f896a8
|
|
| MD5 |
854454830d7144eba7b5150b245004a1
|
|
| BLAKE2b-256 |
9f10901420dc32848605513d620043db74166d641c3c1e0757a9a2cf2c65d44f
|
File details
Details for the file emrrunner-1.0.10-py3-none-any.whl.
File metadata
- Download URL: emrrunner-1.0.10-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceb03678f739a3c379384ef52f9b64e31687391eb0f752f794f10677656decf6
|
|
| MD5 |
67d38e9c5316bce31118eb563faf1f70
|
|
| BLAKE2b-256 |
f2efb51b37784a598c7a090883836dca79b575edae1aa1a8976cfb3235af430d
|