Skip to main content

Create enterprise grade Hadoop cluster in AWS in minutes.

Project description

Create Enterprise grade Hadoop cluster in AWS.
===============================

author: Rakesh Varma

Overview
--------

Create enterprise grade hadoop cluster in AWS in minutes.

Installation / Usage
--------------------

Make sure [terraform](https://www.terraform.io/intro/getting-started/install.html) is installed. It is required to run this solution.

Make sure AWS credentials exists in your local `~/.aws/credentials` file.
If you are using an `AWS_PROFILE` called `test` then your `credentials` file should like looks this:

```sh
[test]
aws_access_key_id = SOMEAWSACCESSKEYID
aws_secret_access_key = SOMEAWSSECRETACCESSKEY
```

Create a `config.ini` with the appropriate settings.

```sh
[default]

# AWS settings
aws_region = us-east-1
aws_profile = test
terraform_s3_bucket = hadoop-terraform-state
ssh_private_key = key.pem
vpc_id = vpc-883883883
vpc_subnets = [
'subnet-89dad652',
'subnet-7887z892',
'subnet-f300b8z8'
]
hadoop_namenode_instance_type = t2.micro
hadoop_secondarynamenode_instance_type = t2.micro
hadoop_datanodes_instance_type = t2.micro
hadoop_datanodes_count = 2

# Hadoop settings
hadoop_replication_factor = 2
```

Once `config.ini` file is ready then install the libs and run. It is recommended to use a virtualenv.

```
pip install aws-hadoop
```
Run this in python to create a hadoop cluster.
```
from aws_hadoop.install import Install
Install().create()
```

For running the source directly,

```sh
pip install -r requirements.txt
```
```sh
from aws_hadoop.install import Install
Install().create()
```

### Configuration Settings

This section describes each of the settings that go into the config file. Note some of the settings are optional.

###### aws_region

The aws_region where your terraform state bucket and your hadoop resources get created (eg: us-east-1)

##### aws_profile

The aws_profile that is used in your local `~/.aws/credentials` file.

##### terraform_s3_bucket

The terraform state information will be maintained in the specified s3 bucket. Make sure the aws_profile has write access to the s3 bucket.

##### ssh_key_pair

For hadoop provisioning, aws_hadoop needs to connect to hadoop nodes using SSH. The specified `ssh_key_pair` will allow the hadoop ec2's to be created with the public key.
If So make sure your machine has the private key in your `~/.ssh/` directory.

##### vpc_id

Specifiy the vpc id your AWS region in which the terraform resources should be created.

##### vpc_subnets

vpc_subnets is a list item that contains one or more subnet_id's. You can specify as many subnet id's as you want. Hadoop EC2 will get created in multiple subnets.

##### hadoop_namenode_instance_type (optional)

Specify the instance type of hadoop namenode. It not specified then the default instance type is `t2.micro`

##### hadoop_secondarynamenode_instance_type (optional)

Specify the instance type of hadoop secondarynamenode. It not specified then the default instance type is t2.micro

##### hadoop_datanodes_instance_type (optional)

Specify the instance type of hadoop datanodes. It not specified then the default instance type is t2.micro

##### hadoop_datanodes_count (optional)

Specify the number of hadoop data nodes that should be created. It not specified then the default value is set to 2

##### hadoop_replication_factor (optional)

Specify the replication factor of hadoop. It not specified then the default value is set to 2.

The following are ssh settings, used to ssh into the nodes.

##### ssh_user (optional)
The ssh user, eg: ubuntu

##### ssh_use_ssh_config (optional)
Set it to True if you want to use your settings in your `~/.ssh/config`

##### ssh_key_file (optional)
This is the key file location. SSH login is done thru a private/public key pair.

##### ssh_proxy (optional)
Use this setting if you are using a proxy ssh server (such as bastion).

Logging
------

A log file `hadoop-cluster.log` is created in the local directory.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aws-hadoop-1.0.dev2.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

aws_hadoop-1.0.dev2-py2.py3-none-any.whl (18.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file aws-hadoop-1.0.dev2.tar.gz.

File metadata

File hashes

Hashes for aws-hadoop-1.0.dev2.tar.gz
Algorithm Hash digest
SHA256 7e5a36d7db768596e787b08a8dee4fb31cdd6e72457e90bc7fa2de330e88ede6
MD5 a942bd2a21a54541355f1f6819009898
BLAKE2b-256 3fd9acefb459f0f65c6c83ab988beb2df831c0fa82ff98fe2eb85551d8d0e74a

See more details on using hashes here.

File details

Details for the file aws_hadoop-1.0.dev2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for aws_hadoop-1.0.dev2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 23dbed504b1c4bb63e0b130162c81460d01511398fcb1467b4f8954891f98ac4
MD5 6c91e175984ee2dcdbe5b4312ef92d86
BLAKE2b-256 d38fa2d86271843bc72450d3e4d6133e165b116a30d1c2ccdfde1c0622d90766

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page