Cluster Troubleshooting
147
7.3.4 Changing the Failure Detection Rate
Use the
SMIT Change/Show a Cluster Network Module
screen to change the
failure detection rate for your network module
only
if enabling I/O pacing or
extending the syncd frequency did not resolve deadman problems in your
cluster. By changing the failure detection rate to “Slow”, you can extend the
time required before the deadman switch is invoked on a hung node and
before a takeover node detects a node failure and acquires a hung node’s
resources. See the
HACMP for AIX, Version 4.3: Administration Guide,
SC23-4279
for more information and instructions on changing the Failure
Detection Rate.
7.4 Node Isolation and Partitioned Clusters
Node isolation
occurs when all networks connecting nodes fail but the nodes
remain up and running. One or more nodes can then be completely isolated
from the others. A cluster in which this has happened is called a
partitioned
cluster
. A partitioned cluster has two groups of nodes (one or more in each),
neither of which cannot communicate with the other. Let’s consider a two
node cluster where all networks have failed between the two nodes, but each
node remains up and running.
The problem with a partitioned cluster is that each node interprets the
absence of keepalives from its partner to mean that the other node has failed,
and then generates node failure events. Once this occurs, each node
attempts to take over resources from a node that is still active and therefore
still legitimately owns those resources. These attempted takeovers can cause
unpredictable results in the cluster—for example, data corruption due to a
disk being reset.
To guard against a TCP/IP subsystem failure causing node isolation, the
nodes should also be connected by a point-to-point serial network. This
connection reduces the chance of node isolation by allowing the Cluster
Managers to communicate even when all TCP/IP-based networks fail.
It is important to understand that the serial network does not carry TCP/IP
communication between nodes; it only allows nodes to exchange keepalives
I/O pacing must be enabled before completing these procedures; it
regulates the number of I/O data transfers. Also, keep in mind that the
Slow
setting for the Failure Detection Rate is network specific.
Note
Содержание AIX HACMP
Страница 2: ......
Страница 10: ...viii IBM Certification Study Guide AIX HACMP...
Страница 12: ...x IBM Certification Study Guide AIX HACMP...
Страница 14: ...xii IBM Certification Study Guide AIX HACMP...
Страница 18: ...xvi IBM Certification Study Guide AIX HACMP...
Страница 24: ...6 IBM Certification Study Guide AIX HACMP...
Страница 110: ...92 IBM Certification Study Guide AIX HACMP...
Страница 133: ...HACMP Installation and Cluster Definition 115...
Страница 134: ...116 IBM Certification Study Guide AIX HACMP...
Страница 160: ...142 IBM Certification Study Guide AIX HACMP...
Страница 200: ...182 IBM Certification Study Guide AIX HACMP...
Страница 216: ...198 IBM Certification Study Guide AIX HACMP...
Страница 222: ...204 IBM Certification Study Guide AIX HACMP...
Страница 226: ...208 IBM Certification Study Guide AIX HACMP...
Страница 232: ...214 IBM Certification Study Guide AIX HACMP...
Страница 240: ...Printed in the U S A SG24 5131 00 IBM Certification Study Guide AIX HACMP SG24 5131 00...