A powerful command-line tool and API for managing Spark jobs on Amazon EMR clusters

These details have not been verified by PyPI

Project links

Project description

EMRRunner

Python Amazon EMR Flask AWS

A powerful command-line tool for managing and deploying Python-based (e.g., PySpark) data pipeline jobs on Amazon EMR clusters.

🚀 Features

Command-line interface for quick job submission
Basic POST API for fast job submission
Support for both client and cluster deploy modes

📋 Prerequisites

Python 3.9+
AWS Account with EMR access
Configured AWS credentials
Active EMR cluster

🛠️ Installation

From PyPI

pip install emrrunner

From Source

# Clone the repository
git clone https://github.com/Haabiy/EMRRunner.git && cd EMRRunner

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate

# Install the package
pip install -e .

⚙️ Configuration

AWS Configuration

Create a .env file in the project root with your AWS configuration or export these variables in your terminal before running:

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION="your_region"
export EMR_CLUSTER_ID="your_cluster_id"
export S3_PATH="s3://your-bucket/path" # The path to your jobs (the directory containing your job_package.zip file)...see `S3 Job Structure` below

or a better approach — instead of exporting these variables in each terminal session, you can add them permanently to your terminal by editing your ~/.zshrc file:

Open your ~/.zshrc file:
```
nano ~/.zshrc
```

Add the following lines at the end of the file (replace with your own AWS credentials):

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION="your_region"
export EMR_CLUSTER_ID="your_cluster_id"
export S3_PATH="s3://your-bucket/path"

Save and exit the file (Ctrl + X).
To apply the changes immediately, run:
```
source ~/.zshrc
```

Now, you won’t have to export the variables manually in each session, and they’ll be available whenever you open a new terminal session.

Bootstrap Actions

For EMR cluster setup with required dependencies, create a bootstrap script (e.g.: bootstrap.sh);

#!/bin/bash -xe

# Example structure of a bootstrap script
# Create and activate virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate

# Install system dependencies
sudo yum install python3-pip -y
sudo yum install -y [your-system-packages]

# Install Python packages
pip3 install [your-required-packages]

deactivate

E.g

#!/bin/bash -xe

# Create and activate a virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate

# Install pip for Python 3.x
sudo yum install python3-pip -y

# Install required packages
pip3 install \
    pyspark==3.5.5 \

deactivate

Upload the bootstrap script to S3 and reference it in your EMR cluster configuration.

📁 Project Structure

EMRRunner/
├── Dockerfile
├── LICENSE.md
├── README.md
├── app/
│   ├── __init__.py
│   ├── cli.py              # Command-line interface
│   ├── config.py           # Configuration management
│   ├── emr_client.py       # EMR interaction logic
│   ├── emr_job_api.py      # Flask API endpoints
│   ├── run_api.py          # API server runner
│   └── schema.py           # Request/Response schemas
├── bootstrap/
│   └── bootstrap.sh        # EMR bootstrap script
├── tests/
│   ├── __init__.py
│   ├── test_config.py
│   ├── test_emr_job_api.py
│   └── test_schema.py
├── pyproject.toml
├── requirements.txt
└── setup.py

📦 S3 Job Structure

The S3_PATH in your configuration should point to a bucket with the following structure:

s3://your-bucket/
├── jobs/
│   ├── job1/
│   │   ├── job_package.zip  # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.
│   └── job2/
│   │   ├── job_package.zip  # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.

Job Script (`main.py`)

Your job script should include the necessary logic for executing the tasks in your data pipeline, using functions from your dependencies.

Example of main.py:

from dependencies import clean, transform, sink  # Import your core job functions

def main():
    # Step 1: Clean the data
    clean()

    # Step 2: Transform the data
    transform()

    # Step 3: Sink (store) the processed data
    sink()

if __name__ == "__main__":
    main()  # Execute the main function when the script is run

💻 Usage

Command Line Interface

Start a job in client mode:

emrrunner start --job job1

Start a job in cluster mode:

emrrunner start --job job1 --deploy-mode cluster

API Endpoints

Start a job via API in client mode (default):

curl -X POST http://localhost:8000/emrrunner/start \
     -H "Content-Type: application/json" \
     -d '{"job": "job1"}'

Start a job via API in cluster mode:

curl -X POST http://localhost:8000/emrrunner/start \
     -H "Content-Type: application/json" \
     -d '{"job": "job1", "deploy_mode": "cluster"}'

🔧 Development

To contribute to EMRRunner:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

💡 Best Practices

Bootstrap Actions
- Keep bootstrap scripts modular
- Version control your dependencies
- Use specific package versions
- Test bootstrap scripts locally when possible
- Store bootstrap scripts in S3 with versioning enabled
Job Dependencies
- Maintain a requirements.txt for each job
- Document system-level dependencies
- Test dependencies in a clean environment
Job Organization
- Follow the standard structure for jobs
- Use clear naming conventions
- Document all functions and modules

📝 License

This project is licensed under the MIT License - see the LICENSE.md file for details.

👥 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

🐛 Bug Reports

If you discover any bugs, please create an issue on GitHub with:

Any details about your local setup that might be helpful in troubleshooting
Detailed steps to reproduce the bug

Built with ❤️ using Python and AWS EMR

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.10

Apr 5, 2025

1.0.9

Nov 3, 2024

1.0.8

Nov 3, 2024

1.0.7

Nov 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emrrunner-1.0.10.tar.gz (9.9 kB view details)

Uploaded Apr 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

emrrunner-1.0.10-py3-none-any.whl (9.0 kB view details)

Uploaded Apr 5, 2025 Python 3

File details

Details for the file emrrunner-1.0.10.tar.gz.

File metadata

Download URL: emrrunner-1.0.10.tar.gz
Upload date: Apr 5, 2025
Size: 9.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for emrrunner-1.0.10.tar.gz
Algorithm	Hash digest
SHA256	`e7fd83a42a163cbd15e0437a20c809aebf42bcb9dea7583ca4ac0d2989f896a8`
MD5	`854454830d7144eba7b5150b245004a1`
BLAKE2b-256	`9f10901420dc32848605513d620043db74166d641c3c1e0757a9a2cf2c65d44f`

See more details on using hashes here.

File details

Details for the file emrrunner-1.0.10-py3-none-any.whl.

File metadata

Download URL: emrrunner-1.0.10-py3-none-any.whl
Upload date: Apr 5, 2025
Size: 9.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for emrrunner-1.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ceb03678f739a3c379384ef52f9b64e31687391eb0f752f794f10677656decf6`
MD5	`67d38e9c5316bce31118eb563faf1f70`
BLAKE2b-256	`f2efb51b37784a598c7a090883836dca79b575edae1aa1a8976cfb3235af430d`

See more details on using hashes here.

emrrunner 1.0.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EMRRunner

🚀 Features

📋 Prerequisites

🛠️ Installation

From PyPI

From Source

⚙️ Configuration

AWS Configuration

Bootstrap Actions

📁 Project Structure

📦 S3 Job Structure

Job Script (main.py)

💻 Usage

Command Line Interface

API Endpoints

🔧 Development

💡 Best Practices

📝 License

👥 Contributing

🐛 Bug Reports

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Job Script (`main.py`)