Skip to main content

A powerful CLI tool and API for managing Spark jobs on Amazon EMR clusters

Project description

EMRRunner (EMR Job Runner)

Python Amazon EMR Flask AWS

A powerful command-line tool and API for managing and deploying Spark jobs on Amazon EMR clusters. EMRRunner simplifies the process of submitting and managing Spark jobs while handling all the necessary environment setup.

๐Ÿš€ Features

  • Command-line interface for quick job submission
  • RESTful API for programmatic access
  • Support for both client and cluster deploy modes
  • Automatic S3 synchronization of job files
  • Configurable job parameters
  • Easy dependency management
  • Bootstrap action support for cluster setup

๐Ÿ“‹ Prerequisites

  • Python 3.9+
  • AWS Account with EMR access
  • Configured AWS credentials
  • Active EMR cluster

๐Ÿ› ๏ธ Installation

From PyPI

pip install emrrunner

From Source

# Clone the repository
git clone https://github.com/yourusername/EMRRunner.git
cd EMRRunner

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: .\venv\Scripts\activate

# Install the package
pip install -e .

โš™๏ธ Configuration

AWS Configuration

Create a .env file in the project root with your AWS configuration:

Note: Export these variables in your terminal before running:

export AWS_ACCESS_KEY=your_access_key
export AWS_SECRET_KEY=your_secret_key
export AWS_REGION=your_region
export EMR_CLUSTER_ID=your_cluster_id
export S3_PATH=s3://your-bucket/path

Bootstrap Actions

For EMR cluster setup with required dependencies, create a bootstrap script (bootstrap.sh):

#!/bin/bash -xe

# Example structure of a bootstrap script
# Create and activate virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate

# Install system dependencies
sudo yum install python3-pip -y
sudo yum install -y [your-system-packages]

# Install Python packages
pip3 install [your-required-packages]

deactivate

Upload the bootstrap script to S3 and reference it in your EMR cluster configuration.

๐Ÿ“ Project Structure

EMRRunner/
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ LICENSE.md
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ cli.py              # Command-line interface
โ”‚   โ”œโ”€โ”€ config.py           # Configuration management
โ”‚   โ”œโ”€โ”€ emr_client.py       # EMR interaction logic
โ”‚   โ”œโ”€โ”€ emr_job_api.py      # Flask API endpoints
โ”‚   โ”œโ”€โ”€ run_api.py          # API server runner
โ”‚   โ””โ”€โ”€ schema.py           # Request/Response schemas
โ”œโ”€โ”€ bootstrap/
โ”‚   โ””โ”€โ”€ bootstrap.sh        # EMR bootstrap script
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ test_config.py
โ”‚   โ”œโ”€โ”€ test_emr_job_api.py
โ”‚   โ””โ”€โ”€ test_schema.py
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ setup.py

๐Ÿ“ฆ S3 Job Structure

The S3_PATH in your configuration should point to a bucket with the following structure:

s3://your-bucket/
โ”œโ”€โ”€ jobs/
โ”‚   โ”œโ”€โ”€ job1/
โ”‚   โ”‚   โ”œโ”€โ”€ dependencies.py   # Shared functions and utilities
โ”‚   โ”‚   โ””โ”€โ”€ job.py           # Main job execution script
โ”‚   โ””โ”€โ”€ job2/
โ”‚       โ”œโ”€โ”€ dependencies.py
โ”‚       โ””โ”€โ”€ job.py

Job Organization

Each job in the S3 bucket follows a standard structure:

  1. dependencies.py

    • Contains reusable functions and utilities specific to the job
    • Example functions:
      def process_data(df):
          # Data processing logic
          pass
      
      def validate_input(data):
          # Input validation logic
          pass
      
      def transform_output(result):
          # Output transformation logic
          pass
      
  2. job.py

    • Main execution script that uses functions from dependencies.py
    • Standard structure:
      from dependencies import process_data, validate_input, transform_output
      
      def main():
          # 1. Read input data
          input_data = spark.read.parquet("s3://input-path")
          
          # 2. Validate input
          validate_input(input_data)
          
          # 3. Process data
          processed_data = process_data(input_data)
          
          # 4. Transform output
          final_output = transform_output(processed_data)
          
          # 5. Write results
          final_output.write.parquet("s3://output-path")
      
      if __name__ == "__main__":
          main()
      

๐Ÿ’ป Usage

Command Line Interface

Start a job in client mode:

emrrunner start --job job1 --step process_daily_data

Start a job in cluster mode:

emrrunner start --job job1 --step process_daily_data --deploy-mode cluster

API Endpoints

Start a job via API in client mode (default):

curl -X POST http://localhost:8000/api/v1/emr/job/start \
     -H "Content-Type: application/json" \
     -d '{"job_name": "job1", "step": "process_daily_data"}'

Start a job via API in cluster mode:

curl -X POST http://localhost:8000/api/v1/emr/job/start \
     -H "Content-Type: application/json" \
     -d '{"job_name": "job1", "step": "process_daily_data", "deploy_mode": "cluster"}'

๐Ÿ”ง Development

To contribute to EMRRunner:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

๐Ÿ’ก Best Practices

  1. Bootstrap Actions

    • Keep bootstrap scripts modular
    • Version control your dependencies
    • Use specific package versions
    • Test bootstrap scripts locally when possible
    • Store bootstrap scripts in S3 with versioning enabled
  2. Job Dependencies

    • Maintain a requirements.txt for each job
    • Use virtual environments
    • Document system-level dependencies
    • Test dependencies in a clean environment
  3. Job Organization

    • Follow the standard structure for jobs
    • Keep dependencies.py focused and modular
    • Use clear naming conventions
    • Document all functions and modules

๐Ÿ”’ Security

  • Supports AWS credential management
  • Validates all input parameters
  • Secure handling of bootstrap scripts

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE.md file for details.

๐Ÿ‘ฅ Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ› Bug Reports

If you discover any bugs, please create an issue on GitHub with:

  • Your operating system name and version
  • Any details about your local setup that might be helpful in troubleshooting
  • Detailed steps to reproduce the bug

Built with โค๏ธ using Python and AWS EMR

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emrrunner-1.0.9.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

emrrunner-1.0.9-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file emrrunner-1.0.9.tar.gz.

File metadata

  • Download URL: emrrunner-1.0.9.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.6

File hashes

Hashes for emrrunner-1.0.9.tar.gz
Algorithm Hash digest
SHA256 c93c7fabcbd4221c761eff5e925433c49fb9ba755ac4231911d6c441abbc25ca
MD5 d62b2031efe51438be8ff758989b4ac3
BLAKE2b-256 266cbcac006fe73fd5a327aa25d0c4d0b0b301eee602ba322b457351050970b5

See more details on using hashes here.

File details

Details for the file emrrunner-1.0.9-py3-none-any.whl.

File metadata

  • Download URL: emrrunner-1.0.9-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.6

File hashes

Hashes for emrrunner-1.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 a6459e7d2fb2a40e85751915b4a3ebe84aeaccb62a6ae4bb7a1ff01f39bf0e1a
MD5 d55c60574cc4e2a56f052ab44d50b02d
BLAKE2b-256 28d764a3b49b0daf9ffa6a1ec05201514031c5ed42171dbeda18cf881e5e1c49

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page