IBM RS/6000 SP Problem Determination Manual Download Page 139

Page: 139 / 316

This soft copy for use by IBM employees only.

6. All the ports on a chip are reporting a problem

If all of the ports on a chip are reporting a problem, and they are not all
connected to the same switch board, it is most likely a problem with the
chip. If they are connected to the same switch board, you will probably
discover that all of the ports connected to that other switch have problems.

7. A single port is reporting a problem

When a single port finds a problem, the maintenance manual is usually quite
adequate in making the call.

The error to watch out for in a single port pattern is the one where the
primary node cannot get into the switch. This can point to a power sequence
or clock problem on the primary node. Check the diagnostics file at
/var/adm/SPlogs/css/dtbx.trace. You may find that it is on an incorrect
clock. Of course, it could still be the adapter, the cable, or the switch port to
which the primary node

′

s adapter is connected.

4.7.3 The flt Log File

Important

This section and the following sections on the flt file apply only to the High
Performance Switch.

Faults are not as important or critical for the SP Switch as they are for the
HiPS switch, so the flt file has changed with the new switch, and it is no
longer an important diagnostic tool.

It is very important to note that the flt file is not a diagnosis of the problem. It
tells you

which device saw the problem. The problem could be with that device,

or it could be with the device to which it is connected.

What you are looking for in the flt file is a pattern of errors over time. One error
in the flt file does not imply a problem. In fact, there is a finite probability that a
fault will occur. What you are looking for is a device that continually faults, say
every day, or whenever a certain application runs, or in some other consistent
fashion.

When looking for patterns, you should realize that the flt file records all
information that the fault daemon on the primary node collects from the switch.
This includes nodes being powered off for service, or whenever someone runs
Estart. You need to recognize such events so that you discard them from your
pattern recognition.

•

Estart

is likely to have occurred when you see a VDC_FLT=0000 or 0080,

and there is a report like this:

No fault isolated, VSP VDC_FLT=0000

•

You must also be able to note when it is likely that a node was powered-off
or a cable was pulled. If you see a fault reported when you know a node
was powered-off, you should ignore it. If you see a fault that indicates a
node might be causing the fault, you should check out the time stamp and
see if the node was powered-off around that time.

•

If you know that the system was in a state of turmoil during a certain period
of time, you should suspect that faults during that time are to be expected.
During such periods of turmoil, it is often the practice of system

IBM RS/6000 SP, Problem Determination Manual

Search results

Summary of Contents for RS/6000 SP

Reviews:

Related manuals for RS/6000 SP

Brands by name

Popular brands