Cluster Time Synchronization(RAC中节点间时间同步问题)

同样,在AIX上装RAC时遇到时间同步问题。使用imp导入数据时节点2宕掉,告警日志如下。

节点2:

2010-05-10 14:38:20.599
[ctssd(5243264)]CRS-2404:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /oracle/app/11.2.0/grid/log/jzdbiufo/ctssd/octssd.log.
2010-05-10 14:38:20.599
[ctssd(5243264)]CRS-2408:The clock on host jzdbiufo has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2010-05-10 15:11:25.818
[ctssd(5243264)]CRS-2404:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /oracle/app/11.2.0/grid/log/jzdbiufo/ctssd/octssd.log.
2010-05-10 15:11:25.818
[ctssd(5243264)]CRS-2408:The clock on host jzdbiufo has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2010-05-10 15:21:46.319
[cssd(4259964)]CRS-1612:Network communication with node jzdbnc (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.926 seconds
2010-05-10 15:21:54.322
[cssd(4259964)]CRS-1611:Network communication with node jzdbnc (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.923 seconds
2010-05-10 15:21:58.348
[cssd(4259964)]CRS-1610:Network communication with node jzdbnc (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.897 seconds
2010-05-10 15:22:01.253
[cssd(4259964)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:22:01.254
[cssd(4259964)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log
2010-05-10 15:22:01.339
[cssd(4259964)]CRS-1652:Starting clean up of CRSD resources.
2010-05-10 15:22:01.927
[cssd(4259964)]CRS-1608:This node was evicted by node 1, jzdbnc; details at (:CSSNM00005:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.

可以看到,在15:22:03.111,节点2被节点1驱逐出集群,在这之前,节点2本地时间与the mean cluster time相差很大,所以一直在更新此节点时间。从之前的告警中也可以看到,一直在报类似的问题。

再去看节点1的告警:

2011-05-10 14:27:51.428
[cssd(4391224)]CRS-1612:Network communication with node jzdbiufo (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.051 seconds
2011-05-10 14:27:58.463
[cssd(4391224)]CRS-1611:Network communication with node jzdbiufo (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.017 seconds
2011-05-10 14:28:03.473
[cssd(4391224)]CRS-1610:Network communication with node jzdbiufo (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.007 seconds
2011-05-10 14:28:05.485
[cssd(4391224)]CRS-1607:Node jzdbiufo is being evicted in cluster incarnation 200459455; details at (:CSSNM00007:) in /oracle/app/11.2.0/grid/log/jzdbnc/cssd/ocssd.log.
2011-05-10 14:28:08.501
[cssd(4391224)]CRS-1625:Node jzdbiufo, number 2, was manually shut down

当然,在此之前,在节点1上也在尝试更新时间。

我们看到,两节点最后一次提示Network communication missing 的时间分别为14:28:03.473和15:21:58.348。理论上相差一个小时左右不会导致宕机。

(其实日志贴出到这里,细心地同学已经能看出问题所在,没有看出来的继续往下看)

我们再来看节点2上octssd的日志:

2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm: The magnitude [31532764119367] of the offset [-31532764119367 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764119367] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]

我们可以看到,节点2收到master(节点1)传过来的时间同步信息,偏移时间达到31532764119367 usec,高于86400000000 usec(CTSS限制)。而节点追上131071 usec。

先把同步问题放在一边,我们看到了节点间时间偏移的限制为86400000000 usec,也就是86400秒,说的更明白点是一天。(我说为什么偏移个一两小时不会影响集群的正常运行呢)

告警说,偏移时间超过一天,可是我们看到的明明是只偏移一小时,计算下,31532764119367 usec是……364天!!!

对!

节点2是2010年,节点1是2011年!!!原来是SA失误把时间调错了。

要求SA将时间调整后,节点2实例自动启动。

总结:CTSS限制为1天时间,节点间时间相差在1天内可以忍受,超过这个限制会以每次约0.13秒(131071 usec)调整。另外,SA在这个case中出了两次错误,这也提醒我们,在SA进行完操作后,我们也要进行验证。这么一个问题在开始的日志中没有找到答案,是我们的错。

普人特福的博客cnzz&51la for wordpress,cnzz for wordpress,51la for wordpress