pshpstuningguidewp040105.doc
Page
28
MAC WOF (2F870): Bit: 1
[. . .]
5.12.4 Packets dropped in the switch hardware
If a packet is dropped within the switch hardware itself (for example, when traversing the link
between two switch chips), evidence of the packet drop is on the HMC, where the switch
Federation Network Manager (FNM) runs. You can run /opt/hsc/bin/fnm.snap
to create a snap
archive in /var/hsc/log (for example, /var/hsc/log/c704hmc1.2004-11-19.12.50.33.snap.tar.gz).
The FNM code handles errors associated with packet drops in the switch. To run the fnm.snap
command (/opt/hsc/bin/fnm.snap), you must have root access or set up proper authentication. In
the snap data, check the FNM_Recov.* logs for switch errors. If a certain type of error reached a
threshold in the hardware, reporting for that type of error might be disabled. As a result, packet
loss might not be reported. Generally, when you are looking for packet loss, it's a good idea to
restart the FNM code to ensure that error reporting is reset.
5.13 MP_INFOLEVEL
You can get additional information from an MPI job by setting the MPI_INFOLEVEL variable to
2
. In addition, if you set the MP_LABELIO variable to
yes
, you can get information for each
task. Here is an example of the output using these settings:
INFO: 0031-364 Contacting LoadLeveler to set and query information for interactive job
INFO: 0031-380 LoadLeveler step ID is test_mach1.customer.com.2507.0
INFO: 0031-118 Host test_mach1.customer.com requested for task 0
INFO: 0031-118 Host test_mach2.customer.com requested for task 1
INFO: 0031-119 Host test_mach1.customer.com allocated for task 0
INFO: 0031-120 Host address 10.10.10.1 allocated for task 0
INFO: 0031-377 Using sn1 for MPI euidevice for task 0
INFO: 0031-119 Host test_mach2.customer.com allocated for task 1
INFO: 0031-120 Host address 10.10.10.2 allocated for task 1
INFO: 0031-377 Using sn1 for MPI euidevice for task 1
1:INFO: 0031-724 Executing program: <spark-thread-bind.lp>
0:INFO: 0031-724 Executing program: <spark-thread-bind.lp>
1:LAPI version #7.9 2004/11/05 1.144 src/rsct/lapi/lapi.c, lapi, rsct_rir2, rir20446a 32bit(us)
library compiled on Wed Nov 10 06:44:38 2004
1:LAPI is using lightweight lock.
1:Bulk Transfer is enabled.
1:Shared memory not used on this node due to sole task running.
1:The LAPI lock is used for the job
0:INFO: 0031-619 32bit(us) MPCI shared object was compiled at Tue Nov 9 12:36:54 2004
0:LAPI version #7.9 2004/11/05 1.144 src/rsct/lapi/lapi.c, lapi, rsct_rir2, rir20446a 32bit(us)
library compiled on Wed Nov 10 06:44:38 2004
0:LAPI is using lightweight lock.
0:Bulk Transfer is enabled.
0:Shared memory not used on this node due to sole task running.
0:The LAPI lock is used for the job