RAC – Failure 1 contacting Cluster Synchronization Services daemon

After a recent storage migration, a version two-node cluster would not start after reboot.  I performed a ‘crsctl start crs’ command as root and no services were started after several minutes.  Although I suspected something had gone wrong with the storage migration, I needed more proof.  I then check the crs as root:

# /u01/oracle/product/crs/bin/crsctl check crs
Failure 1 contacting Cluster Synchronization Services daemon
Cannot communicate with Cluster Ready Services
Cannot communicate with Event Manage

After a quick search, I found a link to a site that had good information on this problem here.  I will paraphrase a bit of the information I found there.

Check the processes:

# ps -aef | grep "init\."
root      4124     1  0 12:08 ?        00:00:00 /bin/sh /etc/init.d/init.evmd run
root      4125     1  0 12:08 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root      4126     1  0 12:08 ?        00:00:00 /bin/sh /etc/init.d/init.crsd run
root      4710  4124  0 12:08 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      5031  4125  0 12:08 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      5289  4126  0 12:08 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

The ‘startcheck’ processes are a good sign indicating that the crs is not disabled.  If the crs was disabled, these processes would not be present.  You can enable and disable the crs by executing the ‘crsctl <enable/disable> crs’ as root.  Disabling the crs to prevent it from starting automatically is a good precaution when the underlying OS needs to be patched or maintenance occurs that could potentially cause multiple reboots of the nodes.

At this point most DBA’s would proceed to the log files in the CRS_HOME, but Surachat Opun in the URL above suggested reviewing the /var/log/messages file on one of the nodes.  Viewing this log, I found many messages similar to the following:

Apr  3 12:11:59 oratest01 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.5289.

Checking one of these files under /tmp revealed more information:

Oracle Cluster Registry initialization failed accessing Oracle Cluster Registry device: PROC-26: Error while accessing the physical storage

So, the problem was with the OCR.  I viewed the contents of the /etc/oracle/ocr.loc file to find where the cluster was trying to find the OCR:


I then reviewed the individual devices:

root@dfw1wui1 [/etc/init.d]
# ll /dev/raw/raw101
crw------- 1 oracle dba 162, 101 Mar 11 19:07 /dev/raw/raw101

# ll /dev/raw/raw201
ls: /dev/raw/raw201: No such file or directory

Notice the use of raw devices that is no longer supported in newer versions of Oracle.  Ok, it can’t find the OCR mirror location.  Fortunately, I was working with a system engineer during the migration and was able to rely on their assistance.   The problem was a typo for the second disk in the rc.local file.  Once the system engineer corrected the typo and rebooted the nodes, the cluster came up automatically as expected.

Thanks to Surachat Opun’s information, I found out something new concerning the ‘startcheck’ services and that Oracle was writing error files to /tmp.  I also learned a new method of troubleshooting cluster issues.  Previously, I would have gone to the CRS_HOME logs after determining that the cluster would not start, but going to the /var/log/messages file may have saved me some time in coming to the same conclusion.