bethel.clustermgmt

Zope Cluster Management facilities

These details have not been verified by PyPI

Project links

Homepage

Project description

This package contains support for managing and monitoring nodes in a cluster. When deploying changes to a zope cluster, it is necessary to proceed linearly across all nodes. Each node should be taken out of service prior to any service disruption. Load balancers typically use a configurable http health-check, and if that health check fails enough times in a certain window, the node is taken out of service. [ this is how varnish works ]. Before deploying changes, we simulate a service disruption on the node, causing the load balancer to take it out of service.

This package contains a “health status” object which the load balancers call for health checks. We can inform the health status object that a node is to be taken out of service. It will then report the node as down (returning an error for the health check), and the load balancer will take it out of service.

Because a load balancer may not send enough information to the backend zope node to enable it to effectively determine which node it is (sounds odd, right?), nodes need to be manually entered via the ZMI manage screen. On the same screen, these nodes can be marked as offline.

Installation

Add bethel.clustermgmt to the eggs and zcml lists in the [instance] part of buildout.cfg, then rerun buildout.

This package uses silva.core functionality to register itself with the zope infrastructure. As such it is listed as an extension in the Silva extension service. It does not need activation in order to be used.

A ‘Cluster Health Reporter’ can now be found in the ‘add’ list in the ZMI.

Configuration

The management screen for a cluster health reporter has two sections. The first is the list of nodes, and the second provides an interface for taking nodes offline.

List of Nodes

Enter the list of nodes in the cluster, one per line. This does not need to be the fqdn of the node, but each node does need a unique entry.

Offline Nodes

The list of nodes is represented here with checkboxes. A node is out of offline (out of service) if it’s box is checked. To manually change the service status of an node (putting it online, taking it offline), check or uncheck the box for that node and click “Save Offline Nodes”.

Use for Monitoring

The load balancer should be configured to query the health status object. If the zope node fails, the health status check will return a system error, or return no response at all (hang). The load balancer will then automatically take the node out of service.

Upon recovery the health status checks will succeed, and the load balancer will automatically bring the node back into service.

Load Balancer configuration (varnish)

Configuring Varnish as a load balancer, and leveraging this health reporter is easy. Let’s assume the following:

there are two nodes in your cluster, node1.example.com:8080 and node2.example.com:8080
the cluster health reporter is located at /health

Add a director for these two nodes in the varnish VCL file:

director zope random {
  {
    .backend = {
      .host = "node1.example.com";
      .port = "8080";
      .first_byte_timeout = 30s;
    }
    .weight = 1;
  }
  {
    .backend = {
      .host = "node2.example.com";
      .port = "8080";
      .first_byte_timeout = 30s;
    }
    .weight = 1;
  }
}

A health check is called a “probe” in VCL. Adding a probe to each backend, the VCL now looks like:

director silva23 random {
{
  .backend = {
    .host = "node1.example.com";
    .port = "8080";
    .first_byte_timeout = 30s;
    .probe = {
      .url = "/health?node=node1";
      .timeout = 0.3 s;
      .window = 8;
      .threshold = 3;
      .initial = 3;
    }
  }
  .weight = 1;
}
{
  .backend = {
    .host = "node2.example.com";
    .port = "8080";
    .first_byte_timeout = 30s;
    .probe = {
      .url = "/health?node=node2";
      .timeout = 0.3 s;
      .window = 8;
      .threshold = 3;
      .initial = 3;
    }
  }
  .weight = 1;
}
}

See the varnish configuration for more information.

Use for Deployments

Using a health status object, rather than an arbitrary web page, for the load balancers health check makes it useful for automatic service removal during system deployments.

The node can me marked as ‘out of service’ via the ZMI, or using REST. The REST approach is useful for automated deployment scripts.

Automated deployments

REST API

This object also responds to REST requests to adjust the service status. Using this method, automated deployment scripts (e.g. using fabric) can take nodes out of service before deploying updates.

Access to the REST API calls are protected using the ‘bethel.clustermgmt.rest’ permission. To access the api calls, the request needs to be authenticated as a manager, or as a user in a role granting this permission.

The REST api has two methods.

Get the status of all nodes (HTTP GET):
```
/path/to/health/++rest++nodestatus
```
Returns a json-formatted dictionary of all nodes, and their status (either online or offline), like this:
```
{nodeA: {status: offline}, nodeB: {status: online}}
```
Alter the status of one or more nodes (HTTP POST):
```
/path/to/health/++rest++setstatus
```
POST data instructs the reporter on the new status for the given nodes. Due to infrae.rest’s lack of support for accepting json payloads, the json input is passed in via a POST parameter named “change”. See the unittests for more info.

The input format is the same the the output from ++rest++nodestatus.

Use in Fabric

A simple python function can trigger a status change for a node. This in turn can be converted into a fabric task. The following is the fabric task we use at Bethel for changing the service status of a node:

env.roledefs = {
  'prod': ['node1.example.com', 'node2.example.com'],
  'dev': ['test-node.example.com']
}
env.buildout_root = "/home/zope/silva23/buildout"

def alter_service_status(newstatus):
  #alter the service status of a zope node,
  #either putting online or offline
  host = env['host_string']
  node = host.split('.')[0]
  url = 'http://%s:8080/silva/varnish_node_is_up/++rest++setstatus'%host
  query = {'change': json.dumps({node: {'status': newstatus}}),
           'skip-bethel-auth': 1}
  req = urllib2.Request(url, query)
  authh = "Basic " + base64.encodestring('%s:%s'%rest_creds)[:-1]
  req.add_header("Authorization", authh)
  response = urllib2.urlopen(req,
                             urllib.urlencode(query))
  back = ''.join(response.readlines())
  return 'OK'

The username and password are read from a protected file when the fabfile is loaded.

This task in turn can be used as a component of a larger automated deployment task (this is the rest of of Bethel’s fabfile):

def buildout():
  with prefix("export HOME=/home/zope/"):
      with cd(env.buildout_root):
          sudo("hg --debug pull -u"%env, user="zope")
          sudo("./bin/buildout"%env, user="zope")

def restart_apache():
  #using the sudo command does not work; it issues the following:
  # sudo -S -p 'sudo password:'  /bin/bash -l -c "/etc/init.d/httpd restart"
  # which runs a shell executing the command in quotes.  Ross was not
  # able to configure sudo to allow multiple httpd options with
  # one line, but suggested the run command instead.
  #sudo("/etc/init.d/httpd restart")
  run("sudo /etc/init.d/httpd restart")

def push_buildout(apache_restart=True):
  if type(apache_restart) in StringTypes:
      apache_restart = (apache_restart == 'True')

  change_status = False
  if env.host_string in env.roledefs['prod']:
      change_status = True

  #take out of service, it takes less time to take out of service than
  #  it does to put back into service
  if change_status:
      puts("taking offline; sleeping 20 seconds")
      alter_service_status('offline')
      sleep(20)

  buildout()
  if apache_restart:
      restart_apache()

  #TODO: test some urls, loading up the local ZODB cache before bringing
  #  back in to service


  #put back into service
  if change_status:
      puts("taking online; sleeping 30 seconds")
      alter_service_status('online')
      sleep(30)

Adding fabric to your buildout is detailed here: http://www.vlent.nl/weblog/2010/09/27/fabric-easy-deployment/

This fabfile is located in the buildout root. Running an automated deployment of our production environment is simple:

./bin/fab -R prod push_buildout

When using mod_wsgi to serve Zope, a restart of apache is required for change to take effect. If for any reason you’d want to push buildout but not restart apache, pass in False to the restart_apache per-task argument:

./bin/fab -R prod push_buildout:restart_apache=False

The combination of fabric and bethel.clustermgmt has decreased deployment time considerably. It is now one command run in the background, whereas before it was a 5-10 minute long repetitive rinse/repeat cycle for each node in the cluster.