HACMP clstrmgrES termination error
Introduction
The HACMP cluster failed over to another node in the cluster and the previously live node is halted.
After the node was brought back up the issue was investigated and the following log files were analyzed.
The AIX error log
errpt -a | more
------------------------------------------------------------------------ LABEL: SRC_SVKO IDENTIFIER: BC3BE5A3 Date/Time: Mon 8 Mar 17:10:07 2010 Sequence Number: 139533 Machine Id: 00C7DFFE4C03 Node Id: node1 Class: S Type: PERM Resource Name: SRC Description SOFTWARE PROGRAM ERROR Probable Causes APPLICATION PROGRAM Failure Causes SOFTWARE PROGRAM Recommended Actions MANUALLY RESTART SUBSYSTEM IF NEEDED Detail Data SYMPTOM CODE 512 SOFTWARE ERROR CODE -9017 ERROR CODE 0 DETECTING MODULE 'srchevn.c'@line:'350' FAILING MODULE clstrmgrES ----------------------------------------------------------------------- LABEL: J2_FS_FULL IDENTIFIER: F7FA22C9 Date/Time: Mon 8 Mar 16:39:57 2010 Sequence Number: 139532 Machine Id: 00C7DFFE4C03 Node Id: node1 Class: O Type: INFO Resource Name: SYSJ2 Description UNABLE TO ALLOCATE SPACE IN FILE SYSTEM Probable Causes FILE SYSTEM FULL Recommended Actions INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM REMOVE UNNECESSARY DATA FROM FILE SYSTEM USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED Detail Data JFS2 MAJOR/MINOR DEVICE NUMBER 000A 0008 FILE SYSTEM DEVICE AND MOUNT POINT /dev/hd3, /tmp ------------------------------------------------------------------------
/usr/es/adm/cluster.log
The file /usr/es/adm/cluster.log displayed the following error.
Mar 8 17:10:07 node1 daemon:err|error snmpd[274636]: EXCEPTIONS: no response after 200 seconds (SMUX 127.0.0.1+45860+6) Mar 8 17:10:07 node1 user:notice HACMP for AIX: clexit.rc : Unexpected termination of clstrmgrES. Mar 8 17:10:07 node1 user:notice HACMP for AIX: clexit.rc : Halting system immediately!!!
As can be seen the /tmp filesystem was 100% utilized.
On this server the log file clstrmgr.debug is written to /tmp and when the process clstrmgrES was no longer able to write to this log file clstrmgrES died causing the cluster to failover.
Fix
An IBM APAR has been opened for this issue, details below.
IZ05428: HACMP UNEXPECTED EXIT DURING LOG CYCLE IN FULL FILESYSTEM APAR status Closed as program error. Error description HACMP cluster manager will exit when there is a failure opening a new log file. This behaviour was designed to protect against unknown issues with running a cluster without sufficient logging space. This behaviour is being changed because the consequences of exiting are the same, or greater than any possible unknown issues with cluster manager continuing to run. Local fix Properly maintain enough space in your logging directories to be able to maintain logs for important RAS information. It is also important to remember that when the cluster manager continues processing without logging, it will be difficult, or impossible to determine the flow of events for debugging, or understanding any cluster actions. Problem summary When HACMP cluster is up and running, cluster manager will call exit when there is a failure opening a log file resulting in a node failure. Problem conclusion Avoid cluster manager calling exit in case of log file opening issue by disabling logging and allowing cluster manager to continue. Logging will be disabled till the issue resulting in log file open failure is resolved example: if log file creation has failed then loggin will be disabled till adequate file system space is provided for the log file to be created. Every time a attempt is made to open a log file we notify by an error through errrpt and stderr. It is also important to remember that when the cluster manager continues processing without logging, it will be difficult, or impossible to determine the flow of events for debugging, or understanding any cluster actions.
Further details can be found at the following link.
http://www-01.ibm.com/support/docview.wss?uid=isg1IZ05428