HAIP异常,导致RAC节点无法启动的解决方案一个网友咨询一个问题,他的11.2.0.2RAC(forAix),没有安装任何patch或PSU。其中一个节点重启之后无法正常启动,查看ocssd日志如下:2014-08-0914:21:46.094:[CSSD][5414]clssnmSendingThread:sent4joinmsgstoallnodes2014-08-0914:21:46.421:[CSSD][4900]clssgmWaitOnEventValue:afterCmInfoStateval3,eval1waited0s2014-08-0914:21:47.042:[CSSD][4129]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958157,LATS1518247992,lastSeqNo255958154,uniqueness1406064021,timestamp1407565306/15017580722014-08-0914:21:47.051:[CSSD][3358]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958158,LATS1518248002,lastSeqNo255958155,uniqueness1406064021,timestamp1407565306/15017581902014-08-0914:21:47.421:[CSSD][4900]clssgmWaitOnEventValue:afterCmInfoStateval3,eval1waited02014-08-0914:21:48.042:[CSSD][4129]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958160,LATS1518248993,lastSeqNo255958157,uniqueness1406064021,timestamp1407565307/15017590802014-08-0914:21:48.052:[CSSD][3358]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958161,LATS1518249002,lastSeqNo255958158,uniqueness1406064021,timestamp1407565307/15017591912014-08-0914:21:48.421:[CSSD][4900]clssgmWaitOnEventValue:afterCmInfoStateval3,eval1waited02014-08-0914:21:49.043:[CSSD][4129]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958163,LATS1518249993,lastSeqNo255958160,uniqueness1406064021,timestamp1407565308/15017600822014-08-0914:21:49.056:[CSSD][3358]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958164,LATS1518250007,lastSeqNo255958161,uniqueness1406064021,timestamp1407565308/15017601932014-08-0914:21:49.421:[CSSD][4900]clssgmWaitOnEventValue:afterCmInfoStateval3,eval1waited02014-08-0914:21:50.044:[CSSD][4129]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958166,LATS1518250994,lastSeqNo255958163,uniqueness1406064021,timestamp1407565309/15017610902014-08-0914:21:50.057:[CSSD][3358]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958167,LATS1518251007,lastSeqNo255958164,uniqueness1406064021,timestamp1407565309/15017611952014-08-0914:21:50.421:[CSSD][4900]clssgmWaitOnEventValue:afterCmInfoStateval3,eval1waited02014-08-0914:21:51.046:[CSSD][4129]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958169,LATS1518251996,lastSeqNo255958166,uniqueness1406064021,timestamp1407565310/15017621002014-08-0914:21:51.057:[CSSD][3358]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958170,LATS1518252008,lastSeqNo255958167,uniqueness1406064021,timestamp1407565310/15017622052014-08-0914:21:51.102:[CSSD][5414]clssnmSendingThread:sendingjoinmsgtoallnodes2014-08-0914:21:51.102:[CSSD][5414]clssnmSendingThread:sent5joinmsgstoallnodes2014-08-0914:21:51.421:[CSSD][4900]clssgmWaitOnEventValue:afterCmInfoStateval3,eval1waited02014-08-0914:21:52.050:[CSSD][4129]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958172,LATS1518253000,lastSeqNo255958169,uniqueness1406064021,timestamp1407565311/15017631102014-08-0914:21:52.058:[CSSD][3358]clssnmvDHBValidateNCopy:node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958173,LATS1518253008,lastSeqNo255958170,uniqueness1406064021,timestamp1407565311/15017632302014-08-0914:21:52.089:[CSSD][5671]clssnmRcfgMgrThread:LocalJoin2014-08-0914:21:52.089:[CSSD][5671]clssnmLocalJoinEvent:beginonnode(2),waittime1930002014-08-0914:21:52.089:[CSSD][5671]clssnmLocalJoinEvent:setcurtime(1518253039)formynode2014-08-0914:21:52.089:[CSSD][5671]clssnmLocalJoinEvent:scanning32nodes2014-08-0914:21:52.089:[CSSD][5671]clssnmLocalJoinEvent:Noderac01,number1,isinanexistingclusterwithdiskstate32014-08-0914:21:52.090:[CSSD][5671]clssnmLocalJoinEvent:takeoverabortedduetoclustermembernodefoundondisk2014-08-0914:21:52.431:[CSSD][4900]clssgmWaitOnEventValue:afterCmInfoStateval3,eval1waited0从上面的信息,很容易给人感觉是心跳的问题。这么理解也不错,只是这里的心跳不是指的我们说理解的传统的心跳网络。我让他在crs正常的一个节点查询如下信息,我们就知道原因了,如下:SQLselectname,ip_addressfromv$cluster_interconnects;NAMEIP_ADDRESS-------------------------------en0169.254.116.242大家可以看到,这里心跳IP为什么是169网段呢?很明显跟我们的/etc/hosts设置不匹配啊?why?这里我们要介绍下Oracle11gR2引入的HAIP特性,Oracle引入该特性的目的是为了通过自身的技术来实现心跳网络的冗余,而不再依赖于第三方技术,比如Linux的bond等等。在Oracle11.2.0.2版本之前,如果使用了OS级别的心跳网卡绑定,那么Oracle仍然以OS绑定的为准。从11.2.0.2开始,如果没有在OS层面进行心跳冗余的配置,那么Oracle自己的HAIP就启用了。所以你虽然设置的192.168.1.100,然而实际上Oracle使用是169.254这个网段。关于这一点,大家可以去看下alertlog,从该日志都能看出来,这里不多说。我们可以看到,正常节点能看到如下的169网段的ip,问题节点确实看不到这个169的网段IP:OracleMOS提供了一种解决方案,如下:crsctlstartresora.cluster_interconnect.haip-init经过测试,使用root进行操作,也是不行的。针对HAIP的无法启动,OracleMOS文档说通常是如下几种情况:1)心跳网卡异常2)多播工作机制异常3)防火墙等原因4)Oraclebug对于心跳网卡异常,如果只有一块心跳网卡,那么ping其他的ip就可以进行验证了,这一点很容易排除。对于多播的问题,可以通过Oracle提供的mcasttest.pl脚本进行检测(请参考GridInfrastructureStartupDuringPatchin