Providing High Availability for Stratos with Pacemaker & Heartbeat

This article describes how High Availability (HA) can be configured with Pacemaker & Heartbeat for Apache Stratos. In general this concept can be applied for any server application which needs HA and does not require any data replication. If data replication is needed you may need to consider using DRBD with Pacemaker. First of all we will see what Pacemaker and Heartbeat are and go through a series of steps on configuring those.

What is Pacemaker?

Pacemaker is a Cluster Resource Manager (CRM) which can detect and recover from failures of nodes and resources. It basically can start, stop, check the status of a resource and take decisions for recovering them from failures.

What is a resource? A resource either can be a server application, an IP address or any other software/hardware resource that you can think of. These resources are managed through Resource Agents which is an abstraction layer that Pacemaker make use of to communicate with different types of resources. Out of the box Pacemaker provides Resource Agents for OCF and LSB services. In this example we will be using LSB Resource Agent to manage Apache Stratos as an init.d service.

What is Heartbeat?

Heartbeat is a daemon that provides messaging infrastructure for Pacemaker. It manages the communication between nodes and allows to know the presence of resources in the cluster.

Prerequisites

  • Oracle VirtualBox or any other virtualization technology
  • Ubuntu 12.04 server (64-bit) virtual machine image
  • Pacemaker 1.1.6 or above
  • Heartbeat 3.0.5 or above

Steps for Configuring Pacemaker & Heartbeat for Apache Stratos:

  • Start two instances of Ubuntu 12.04 server virtual machines.

  • Switch to root user: sudo su

  • Install Pacemaker and Heartbeat

    apt-get install pacemaker heartbeat
    
  • Create Heartbeat configuration file at the following location: /etc/ha.d/ha.cf

    # enable pacemaker, without stonith
    crm yes
    # define log file
    logfile /var/log/ha-log
    # warning of soon be dead
    warntime 10
    # declare a host (the other node) dead after:
    deadtime 20
    # dead time on boot (could take some time until net is up)
    initdead 120
    # time between heartbeats
    keepalive 2
    # the nodes
    node node1 # set node1 hostname
    node node2 # set node2 hostname
    # heartbeats, over dedicated replication interface
    ucast eth1 10.186.175.16 # set node1 network-interface and ip address
    ucast eth1 54.211.110.217 # set node2 network-interface and ip address
  • Create authentication key file and set permissions in one of the hosts:

    ( echo -ne "auth 1\n1 sha1 "; \
    dd if=/dev/urandom bs=512 count=1 | openssl md5 ) \
    > /etc/ha.d/authkeys
    chmod 0600 /etc/ha.d/authkeys
    
  • Copy the above authkeys file to each host (/etc/ha.d/authkeys).

  • Restart heartbeat service: service heartbeat restart

  • Now check the status of the Pacemaker cluster using CRM, here all nodes in the cluster should be in online state. If not check the heartbeat configuration again.

    crm status
    ============
    Last updated: Wed Oct 15 11:25:05 2014
    Last change: Wed Oct 15 11:21:51 2014 via crmd on ip-10-186-175-16
    Stack: Heartbeat
    Current DC: ip-10-186-175-16 (d16ccc5c-2641-42b6-b46a-57a0b32fddc9) - partition with quorum
    Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
    2 Nodes configured, unknown expected votes
    0 Resources configured.
    ============
    Online: [ ip-10-186-175-16 ip-10-153-165-178 ]
    
  • Disable STONITH: crm configure property stonith-enabled=false

  • Create a Failover IP resource to manage the virtual IP address:

    crm configure primitive FAILOVER-IP ocf:heartbeat:IPaddr params ip=192.168.10.20 cidr_netmask="255.255.255.0" op monitor interval=10s
    
  • SCP java and Apache Stratos packages to both hosts and extract them under /opt folder.

  • Create an init.d script for Stratos using following code and update USER, JAVA_HOME and PRODUCT_HOME variable values:

    ```` https://gist.github.com/imesh/5256272cd71b74a06581

    #!/bin/sh

    BEGIN INIT INFO

    Provides: stratos

    Required-Start: $local_fs $remote_fs $network $syslog $named

    Required-Stop: $local_fs $remote_fs $network $syslog $named

    Default-Start: 2 3 4 5

    Default-Stop: 0 1 6

    X-Interactive: true

    Short-Description: Start/stop stratos server

    END INIT INFO

    USER=“vagrant” PRODUCT_NAME=“stratos” JAVA_HOME=“/opt/jdk1.7.0_60” PRODUCT_HOME=“/opt/apache_stratos_4.1.0_SNAPSHOT” PID_FILE=“${PRODUCT_HOME}/wso2carbon.pid” CMD=“${PRODUCT_HOME}/bin/stratos.sh”

    LSB exit codes:

    ftp://ftp.nomadlinux.com/nomad-2/dist/heartbeat-1.2.5/include/clplumbing/lsb_exitcodes.h

    LSB_EXIT_OK=0 LSB_EXIT_GENERIC=1 LSB_EXIT_EINVAL=2 LSB_EXIT_ENOTSUPPORTED=3 LSB_EXIT_EPERM=4 LSB_EXIT_NOTINSTALLED=5 LSB_EXIT_NOTCONFIGED=6 LSB_EXIT_NOTRUNNING=7

    is_service_running() { if [ -e ${PID_FILE} ]; then PID=cat ${PID_FILE} if ps -p $PID >&- ; then # service is running return 0 else # service is stopped return 1 fi else # pid file was not found, may be server was not started before return 1 fi }

    Status the service

    status() { is_service_running service_status=$?

    if [ "${service_status}" -eq 0 ]; then
        echo "${PRODUCT_NAME} service is running"
        return ${LSB_EXIT_OK}
        elif [ "${service_status}" -eq 1 ]; then
        echo "$PRODUCT_NAME service is stopped"
        return ${LSB_EXIT_OK}
        else
        echo "$PRODUCT_NAME service status is unknown"
        return ${LSB_EXIT_GENERIC}
        fi
    }
    
    
    # Start the service
    start() {
        if is_service_running; then
        echo "${PRODUCT_NAME} service is already running"
        return ${LSB_EXIT_OK}
        fi
    
    
        echo "starting ${PRODUCT_NAME} service..."
        su - ${USER} -c "export JAVA_HOME=${JAVA_HOME}; ${CMD} start"
    
    
        is_service_running
        service_status=$?
        while [ "$service_status" -ne "0" ]
        do
        sleep 1;
        is_service_running
        service_status=$?
        done
    
    
        echo "${PRODUCT_NAME} service started"
        return ${LSB_EXIT_OK}
    }
    
    
    # Restart the service
    restart() {
        echo "restarting ${PRODUCT_NAME} service..."
        su - ${USER} -c "export JAVA_HOME=${JAVA_HOME}; ${CMD} restart"
        echo "${PRODUCT_NAME} service restarted"
        return ${LSB_EXIT_OK}
    }
    
    
    # Stop the service
    stop() {
        if ! is_service_running; then
        echo "${PRODUCT_NAME} service is already stopped"
        return ${LSB_EXIT_OK}
        fi
    
    
        echo "stopping ${PRODUCT_NAME} service..."
        su - ${USER} -c "export JAVA_HOME=${JAVA_HOME}; ${CMD} stop"
    
    
        is_service_running
        service_status=$?
        while [ "$service_status" -eq "0" ]
        do
        sleep 1;
        is_service_running
        service_status=$?
        done
    
    
        echo "${PRODUCT_NAME} service stopped"
        return ${LSB_EXIT_OK}
    }
    ### main logic ###
    case "$1" in
    start)
        start
    ;;
    stop|graceful-stop)
        stop
    ;;
    status)
        status
    ;;
    restart|reload|force-reload)
        restart
    ;;
    *)
    echo $"usage: $0 {start|stop|graceful-stop|restart|reload|force-reload|status}"
    exit 1
    esac
    exit $?
    ````
    
  • Create a CRM resource for stratos:

crm configure primitive STRATOS lsb::stratos op monitor interval=15s
  • Create a CRM resource group and add FAILOVER-IP and STRATOS resources:
crm configure group FAILOVER-IP-RESOURCE-GROUP FAILOVER-IP STRATOS
  • Configure a colocation dependency between FAILOVER-IP and STRATOS. This will make sure that both FAILOVER-IP and STRATOS resources will stay in the same host.
crm configure colocation FAILOVER-IP-RESOURCE-GROUP-COLOCATION inf: FAILOVER-IP STRATOS

Imesh Gunaratne

Read more posts by this author.

Colombo, LK http://imesh.io