This soft copy for use by IBM employees only.
6. All the ports on a chip are reporting a problem
If all of the ports on a chip are reporting a problem, and they are not all
connected to the same switch board, it is most likely a problem with the
chip. If they are connected to the same switch board, you will probably
discover that all of the ports connected to that other switch have problems.
7. A single port is reporting a problem
When a single port finds a problem, the maintenance manual is usually quite
adequate in making the call.
The error to watch out for in a single port pattern is the one where the
primary node cannot get into the switch. This can point to a power sequence
or clock problem on the primary node. Check the diagnostics file at
/var/adm/SPlogs/css/dtbx.trace. You may find that it is on an incorrect
clock. Of course, it could still be the adapter, the cable, or the switch port to
which the primary node
′
s adapter is connected.
4.7.3 The flt Log File
Important
This section and the following sections on the flt file apply only to the High
Performance Switch.
Faults are not as important or critical for the SP Switch as they are for the
HiPS switch, so the flt file has changed with the new switch, and it is no
longer an important diagnostic tool.
It is very important to note that the flt file is not a diagnosis of the problem. It
tells you
which device saw the problem. The problem could be with that device,
or it could be with the device to which it is connected.
What you are looking for in the flt file is a pattern of errors over time. One error
in the flt file does not imply a problem. In fact, there is a finite probability that a
fault will occur. What you are looking for is a device that continually faults, say
every day, or whenever a certain application runs, or in some other consistent
fashion.
When looking for patterns, you should realize that the flt file records all
information that the fault daemon on the primary node collects from the switch.
This includes nodes being powered off for service, or whenever someone runs
Estart. You need to recognize such events so that you discard them from your
pattern recognition.
•
Estart
is likely to have occurred when you see a VDC_FLT=0000 or 0080,
and there is a report like this:
No fault isolated, VSP VDC_FLT=0000
.
•
You must also be able to note when it is likely that a node was powered-off
or a cable was pulled. If you see a fault reported when you know a node
was powered-off, you should ignore it. If you see a fault that indicates a
node might be causing the fault, you should check out the time stamp and
see if the node was powered-off around that time.
•
If you know that the system was in a state of turmoil during a certain period
of time, you should suspect that faults during that time are to be expected.
During such periods of turmoil, it is often the practice of system
Chapter 4. The Switch
119
Summary of Contents for RS/6000 SP
Page 2: ......
Page 14: ...This soft copy for use by IBM employees only xii SP PD Guide...
Page 16: ...This soft copy for use by IBM employees only xiv SP PD Guide...
Page 106: ...This soft copy for use by IBM employees only 86 SP PD Guide...
Page 178: ...This soft copy for use by IBM employees only 158 SP PD Guide...
Page 214: ...This soft copy for use by IBM employees only 194 SP PD Guide...
Page 248: ...This soft copy for use by IBM employees only 228 SP PD Guide...
Page 290: ...This soft copy for use by IBM employees only 270 SP PD Guide...
Page 292: ...This soft copy for use by IBM employees only 272 SP PD Guide...
Page 300: ...This soft copy for use by IBM employees only 280 SP PD Guide...
Page 304: ...This soft copy for use by IBM employees only 284 SP PD Guide...
Page 308: ...This soft copy for use by IBM employees only 288 SP PD Guide...
Page 310: ...This soft copy for use by IBM employees only 290 SP PD Guide...
Page 316: ...IBML This soft copy for use by IBM employees only Printed in U S A SG24 4778 00...