Ahead of the Pack: the Pacemaker High-Availability Stack

Corosync

Corosync's configuration files live in /etc/corosync, and the central configuration is in /etc/corosync/corosync.conf. Here's an example of the contents of this file:


totem {
  # Enable node authentication & encryption
  secauth: on

  # Redundant ring protocol: none, active, passive.
  rrp_mode: active
  
  # Redundant communications interfaces
  interface {
    ringnumber: 0
    bindnetaddr: 192.168.0.0
    mcastaddr: 239.255.29.144
    mcastport: 5405
  }
  interface {
    ringnumber: 1
    bindnetaddr: 192.168.42.0
    mcastaddr: 239.255.42.0
    mcastport: 5405
  }
}

amf {
  mode: disabled
}

service {
  # Load Pacemaker
  ver: 1
  name: pacemaker
}

logging {
  fileline: off
  to_stderr: yes
  to_logfile: no
  to_syslog: yes
  syslog_facility: daemon
  debug: off
  timestamp: on
}

The important bits here are the two interface declarations enabling redundant cluster communications and the corresponding rrp_mode declaration. Mutual node authentication and encryption (secauth on) is good security practice. And finally, the service stanza loads the Pacemaker cluster manager as a Corosync plugin.

With secauth enabled, Corosync also requires a shared secret for mutual node authentication. Corosync uses a simple 128-byte secret that it stores as /etc/corosync/authkey, and which you easily can generate with the corosync-keygen utility.

Once corosync.conf and authkey are in shape, copy them over to all nodes in your prospective cluster. Then, fire up Corosync cluster communications—a simple service corosync start will do.

Once the service is running on all nodes, the command corosync-cfgtool -s will display both rings as healthy, and the cluster is ready to communicate:


Printing ring status.
Local node ID 303938909
RING ID 0
        id      = 192.168.0.1
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.42.1
        status  = ring 1 active with no faults

Pacemaker

Once Corosync runs, we can start Pacemaker with the service pacemaker start command. After a few seconds, Pacemaker elects a Designated Coordinator (DC) node among the three available nodes and commences full cluster operations. The crm_mon utility, executable on any cluster node, then produces output similar to this:


============
Last updated: Fri Feb  3 18:40:15 2012
Stack: openais
Current DC: bob - partition with quorum
Version: 1.1.6-4.el6-89678d4947c5bd466e2f31acd58ea4e1edb854d5
3 Nodes configured, 3 expected votes
0 Resources configured.
============

The output produced by crm_mon is a more user-friendly representation of the internal cluster configuration and status stored in a distributed XML database, the Cluster Information Base (CIB). Those interested and brave enough to care about the internal representation are welcome to make use of the cibadmin -Q command. But be warned, before you do so, you may want to instruct the junior sysadmin next to you to get some coffee—the avalanche of XML gibberish that cibadmin produces can be intimidating to the uninitiated novice.

Much less intimidating is the standard configuration facility for Pacemaker, the crm shell. This self-documenting, hierarchical, scriptable subshell is the simplest and most universal way of manipulating Pacemaker clusters. In its configure submenu, the shell allows us to load and import configuration snippets—or even complete configurations, as below:


primitive p_iscsi ocf:heartbeat:iscsi \
  params portal="192.168.122.100:3260" \
    target="iqn.2011-09.com.hastexo:virtcluster" \
  op monitor interval="10"
primitive p_xray ocf:heartbeat:VirtualDomain \
  params config="/etc/libvirt/qemu/xray.xml" \
  op monitor interval="30" timeout="30" \
  op start interval="0" timeout="120" \
  op stop interval="0" timeout="120"
primitive p_yankee ocf:heartbeat:VirtualDomain \
  params config="/etc/libvirt/qemu/yankee.xml" \
  op monitor interval="30" timeout="30" \
  op start interval="0" timeout="120" \
  op stop interval="0" timeout="120"
primitive p_zulu ocf:heartbeat:VirtualDomain \
  params config="/etc/libvirt/qemu/zulu.xml" \
  op monitor interval="30" timeout="30" \
  op start interval="0" timeout="120" \
  op stop interval="0" timeout="120"
clone cl_iscsi p_iscsi
colocation c_xray_on_iscsi inf: p_xray cl_iscsi
colocation c_yankee_on_iscsi inf: p_yankee cl_iscsi
colocation c_zulu_on_iscsi inf: p_zulu cl_iscsi
order o_iscsi_before_xray inf: cl_iscsi p_xray
order o_iscsi_before_yankee inf: cl_iscsi p_yankee
order o_iscsi_before_zulu inf: cl_iscsi p_zulu

Besides defining our three virtual domains as resources under full cluster management and monitoring (p_xray, p_yankee and p_zulu), this configuration also ensures that all domains can find their storage (the cl_iscsi clone), and that they wait until iSCSI storage becomes available (the order and colocation constraints).

This being a single-instance storage cluster, it's imperative that we also employ safeguards against shredding our data. This is commonly known as node fencing, but Pacemaker uses the more endearing term STONITH (Shoot The Other Node In The Head) for the same concept. A ubiquitous means of node fencing is controlling nodes via their IPMI Baseboard Management Controllers, and Pacemaker supports this natively:


primitive p_ipmi_alice stonith:external/ipmi \
  params hostname="alice" ipaddr="192.168.15.1" \
    userid="admin" passwd="foobar" \
  op start interval="0" timeout="60" \
  op monitor interval="120" timeout="60"
primitive p_ipmi_bob stonith:external/ipmi \
  params hostname="bob" ipaddr="192.168.15.2" \
    userid="admin" passwd="foobar" \
  op start interval="0" timeout="60" \
  op monitor interval="120" timeout="60"
primitive p_ipmi_charlie stonith:external/ipmi \
  params hostname="charlie" ipaddr="192.168.15.3" \
    userid="admin" passwd="foobar" \
  op start interval="0" timeout="60" \
  op monitor interval="120" timeout="60"
location l_ipmi_alice p_ipmi_alice -inf: alice
location l_ipmi_bob p_ipmi_bob -inf: bob
location l_ipmi_charlie p_ipmi_charlie -inf: charlie
property stonith-enabled="true"

The three location constraints here ensure that no node has to shoot itself.

Once that configuration is active, Pacemaker fires up resources as determined by the cluster configuration. Again, we can query the cluster state with the crm_mon command, which now produces much more interesting output than before:


============
Last updated: Fri Feb  3 19:46:29 2012
Stack: openais
Current DC: bob - partition with quorum
Version: 1.1.6-4.el6-89678d4947c5bd466e2f31acd58ea4e1edb854d5
3 Nodes configured, 3 expected votes
9 Resources configured.
============

Online: [ alice bob charlie ]

 Clone Set: cl_iscsi [p_iscsi]
     Started: [ alice bob charlie ]
 p_ipmi_alice   (stonith:external/ipmi):        Started bob
 p_ipmi_bob     (stonith:external/ipmi):        Started charlie
 p_ipmi_charlie	(stonith:external/ipmi):        Started alice
 p_xray	(ocf::heartbeat:VirtualDomain):	Started alice
 p_yankee       (ocf::heartbeat:VirtualDomain):	Started bob
 p_zulu	(ocf::heartbeat:VirtualDomain):	Started charlie

Note that by default, Pacemaker clusters are symmetric. The resource manager balances resources in a round-robin fashion among cluster nodes.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Its said changes charge to be

Tyrwhittse's picture

Its said changes charge to be fabricated because of its brokenness. Boloney! Get aback in the MB star C3 drivers bench and acquisition a distro that puts you aback abaft the caster and not these distributions that accept put aqueduct

Interesting

tires's picture

This one is quite an interesting read. Thanks for sharing.

HA Documentation

Adrie T's picture

Where can we find a more detail documentation that easy to undertand about Linux HA beside the link that you have mention above, especially about Corosync and Pacemaker.
The two documentation "Cluster from Scratch" dan "Pacemaker Explained" was too short and directly jump start to the configuration without any explanation about how those two technology works and step by step configuration.
If there is any documentation like http://www.drbd.org/home/what-is-drbd/ that explained DRBD in an easy way it will be a great help to understand about PaceMaker and Corosync.
It seems that very few good documentation about Linux HA available everywhere.

Reply to comment | Linux Journal

[need money|how to make money on the side|money making|'s picture

Hello there, just became aware of your blog through Google,
and found that it's really informative. I am gonna watch out for brussels. I'll appreciate if you continue this in future. Many people will be benefited from your writing. Cheers!

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix