After a recent storage migration, a version 18.104.22.168.0 two-node cluster would not start after reboot. I performed a ‘crsctl start crs’ command as root and no services were started after several minutes. Although I suspected something had gone wrong with the storage migration, I needed more proof. I then check the crs as root:
# /u01/oracle/product/crs/bin/crsctl check crs Failure 1 contacting Cluster Synchronization Services daemon Cannot communicate with Cluster Ready Services Cannot communicate with Event Manage
After a quick search, I found a link to a site that had good information on this problem here. I will paraphrase a bit of the information I found there.
Check the processes:
# ps -aef | grep "init\." root 4124 1 0 12:08 ? 00:00:00 /bin/sh /etc/init.d/init.evmd run root 4125 1 0 12:08 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal root 4126 1 0 12:08 ? 00:00:00 /bin/sh /etc/init.d/init.crsd run root 4710 4124 0 12:08 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck root 5031 4125 0 12:08 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck root 5289 4126 0 12:08 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
The ‘startcheck’ processes are a good sign indicating that the crs is not disabled. If the crs was disabled, these processes would not be present. You can enable and disable the crs by executing the ‘crsctl <enable/disable> crs’ as root. Disabling the crs to prevent it from starting automatically is a good precaution when the underlying OS needs to be patched or maintenance occurs that could potentially cause multiple reboots of the nodes.
At this point most DBA’s would proceed to the log files in the CRS_HOME, but Surachat Opun in the URL above suggested reviewing the /var/log/messages file on one of the nodes. Viewing this log, I found many messages similar to the following:
Apr 3 12:11:59 oratest01 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.5289.
Checking one of these files under /tmp revealed more information:
Oracle Cluster Registry initialization failed accessing Oracle Cluster Registry device: PROC-26: Error while accessing the physical storage
So, the problem was with the OCR. I viewed the contents of the /etc/oracle/ocr.loc file to find where the cluster was trying to find the OCR:
ocrconfig_loc=/dev/raw/raw101 ocrmirrorconfig_loc=/dev/raw/raw201 local_only=FALSE
I then reviewed the individual devices:
root@dfw1wui1 [/etc/init.d] # ll /dev/raw/raw101 crw------- 1 oracle dba 162, 101 Mar 11 19:07 /dev/raw/raw101 # ll /dev/raw/raw201 ls: /dev/raw/raw201: No such file or directory
Notice the use of raw devices that is no longer supported in newer versions of Oracle. Ok, it can’t find the OCR mirror location. Fortunately, I was working with a system engineer during the migration and was able to rely on their assistance. The problem was a typo for the second disk in the rc.local file. Once the system engineer corrected the typo and rebooted the nodes, the cluster came up automatically as expected.
Thanks to Surachat Opun’s information, I found out something new concerning the ‘startcheck’ services and that Oracle was writing error files to /tmp. I also learned a new method of troubleshooting cluster issues. Previously, I would have gone to the CRS_HOME logs after determining that the cluster would not start, but going to the /var/log/messages file may have saved me some time in coming to the same conclusion.