background image

Scali Manage

TM

On SGI

®

Altix

®

ICE

System Quick Reference Guide

007–5450–001

Summary of Contents for Altix ICE 8000 Series

Page 1: ...Scali ManageTM On SGI Altix ICE System Quick Reference Guide 007 5450 001 ...

Page 2: ...ovided with limited rights as defined in 52 227 14 TRADEMARKS AND ATTRIBUTIONS SGI the SGI logo and Altix are registered trademarks and SGI ProPack is a trademark of SGI in the United States and or other countries worldwide Altair is a registered trademark and PBS Professional is a trademark of Altair Engineering Inc Intel Xeon and Itanium are trademarks or registered trademarks of Intel Corporati...

Page 3: ...Record of Revision Version Description 001 April 2008 Original publication 007 5450 001 iii ...

Page 4: ......

Page 5: ... System Building Blocks 1 InfiniBand Fabric 3 Gigabit Ethernet Network 4 Individual Rack Unit 4 Power Supply 4 Four tier Hierarchical Framework 5 Chassis Manager 6 System Nodes 7 System Admin Controller 8 Rack Leader Controller 8 Chassis Management Control CMC Blade 9 Compute Node 9 Individual Rack Unit 10 Login Service Node 10 Batch Service Node 10 Gateway Service Node 11 007 5450 001 v ...

Page 6: ...ntrol CMC Blade 25 Compute Nodes 25 2 Getting Started with Scali Manage 27 Installing or Updating Software 27 Administrative Tips 28 System Password Information 28 Power on or Power off System Components or Obtain Status 28 Scali Manage Installer Directory 29 Scali Manage Command CLI Help 30 Configuring the Scali Manage Server 31 Defining New Racks or Service Nodes 31 Discovering Service and Leade...

Page 7: ...Compute Node RPMs on RHEL 41 3 System Fabric Management 43 Appendix A InfiniBand Fabric Details 45 InfiniBand Fabric Management Configuration and Operation Overview 45 Configuring and Initializing the InfiniBand Fabric Manually 51 Appendix B InfiniBand Fabric Troubleshooting 55 Useful Utilities and Diagnostics 55 ibstat Command 56 ibstatus Command 57 perfquery Command 58 ibnetdiscover Command 59 i...

Page 8: ......

Page 9: ..._GBE and VLAN_BMC Network Connections IRU View 18 Figure 1 7 VLAN_GBE and VLAN_BMC Network Connections Rack View 19 Figure 1 8 VLAN_HEAD Network Connections 20 Figure 1 9 Two InfiniBand Fabrics in a System with Two IRUs 22 Figure 2 1 Example Starting Screen for the Scali Manage GUI 38 Figure 2 2 Cluster Components Selection Screen Example 39 Figure A 1 Two InfiniBand Fabrics in a System with Two I...

Page 10: ......

Page 11: ...Examples Example A 1 opensm ib0 conf and opensm ib conf Configuration Files 45 007 5450 001 xi ...

Page 12: ......

Page 13: ...Procedures Procedure A 1 Configuring and Initializing the InfiniBand Fabric Manually 51 007 5450 001 xiii ...

Page 14: ......

Page 15: ...Fabric Management on page 43 Appendix A InfiniBand Fabric Details on page 45 Appendix B InfiniBand Fabric Troubleshooting on page 55 Related Publications This section describes documentation you may find useful as follows SGI Altix ICE 8000 System User s Guide This is the hardware users guide for the SGI Alitx 8000 series systems It describes the features of the SGI Altix ICE 8000 series system as...

Page 16: ... in the following ways See the SGI Technical Publications Library at http docs sgi com Various formats are available This library contains the most recent and most comprehensive set of online books release notes man pages and other information Online versions of the SGI ProPack 5 for Linux Service Pack 4 Start Here the SGI ProPack 5 SP4 release notes which contain the latest information about soft...

Page 17: ... command or directive line Ellipses indicate that a preceding element can be repeated Reader Comments If you have comments about the technical accuracy content or organization of this publication contact SGI Be sure to include the title and document number of the publication with your comments Online the document number is located in the front matter of the publication In printed publications the ...

Page 18: ...About This Guide Sunnyvale CA 94085 4602 SGI values your comments and will respond to them promptly xviii 007 5450 001 ...

Page 19: ...lowing topics Basic System Building Blocks on page 1 System Nodes on page 7 For a detailed hardware description see the SGI Altix ICE 8000 Series System Hardware User s Guide Basic System Building Blocks The SGI Altix ICE 8000 series system is a blade based scalable high density compute system The basic building block is the individual rack unit IRU The IRU provides power cooling system control an...

Page 20: ... manager Independent Rack Unit IRU Power supplies Rack leader controller 42U High Rack IRU IRU Admin server Figure 1 1 Basic System Building Blocks This hardware overview section covers the following topics InfiniBand Fabric on page 3 Gigabit Ethernet Network on page 4 Individual Rack Unit on page 4 2 007 5450 001 ...

Page 21: ...d for storage related traffic The default configuration for MPI is to use only the ib0 fabric For more information on the InfiniBand fabric see Chapter 3 System Fabric Management on page 43 Note The ib0 fabric is a convenient shorthand for the fabric which is connected to the ib0 interface on most of the nodes In the case of the storage service node there are four interfaces called ib0 through ib3...

Page 22: ...tly below compute blade slot 0 as shown in Figure 1 1 on page 2 This is the chassis manager that performs environmental control and monitoring of the IRU The CMC controls master power to the compute blades under direction of the rack leader controller leader node The leader node can also query the CMC for monitored environmental data temperatures fan speeds and so on for the IRU Power control for ...

Page 23: ...is management controller CMC one per IRU Baseboard Management Controller BMC one per compute node admin node leader node and managed service node Unlike traditional flat clusters the SGI Altix ICE 8000 series system does not have a head node The head node is replaced by a hierarchy of nodes that enables system resources to scale as you add processors This hierarchy is as follows System admin contr...

Page 24: ... information see VLANs on page 16 and Network Interface Naming Conventions on page 22 The rack leader controller leader node and admin node are described in the section that follows System Nodes on page 7 Chassis Manager Figure 1 2 on page 7 shows chassis manager cabling Note All nodes reside in the Altix ICE custom designed rack Figure 1 2 on page 7 and Figure 1 3 on page 12 show how systems are ...

Page 25: ...Cabling Figure 1 3 on page 12 shows cabling for a service node and storage service node NAS cube System Nodes This section describes the system nodes that are part of SGI Altix ICE 8000 series system and covers the following topics System Admin Controller on page 8 Rack Leader Controller on page 8 Chassis Management Control CMC Blade on page 9 007 5450 001 7 ...

Page 26: ...odes compute nodes and service nodes It receives and holds aggregated management data from the leader nodes The admin node is an appliance node It always runs software specified by SGI The kernels initrds and root filesystems which together make up an image reside on the admin node When compute nodes are first set up with a new image the leader nodes will cache this information to reduce the netwo...

Page 27: ... each blade is handled by the Baseboard Management Controller BMC also under direction of the rack leader controller Once the leader node has asked the CMC to enable master power the leader node can then command each BMC to power up its associated blade The leader node can also query each BMC to obtain some environmental and error log information about each blade Compute Node Figure 1 1 on page 2 ...

Page 28: ...1 1 on page 2 It is described in detail in Basic System Building Blocks on page 1 Login Service Node The login service node allows users to login into the system to create compile and run applications The login node is usually combined with batch and gateway service nodes for most configurations The login service node is connected to the Altix ICE system via the InfiniBand fabric and GigE to the p...

Page 29: ...ice Node The storage service node is a network attached storage NAS appliance bundle that provides InfiniBand attached storage for the Altix ICE system There can be multiple storage service nodes for larger Altix ICE system configurations Figure 1 3 on page 12 shows a service node and a storage service node NAS cube Note All nodes reside in the Altix ICE custom designed rack Figure 1 2 on page 7 a...

Page 30: ... node Figure 1 3 Service Nodes Networks This section describes the Gigabit Ethernet GigE and 10 100 Ethernet connections and the InfiniBand fabric in an SGI Altix ICE 8000 series system and covers the following topics Networks Overview on page 13 Gigabit Ethernet GigE and 10 100 Ethernet Connections on page 14 VLANs on page 16 12 007 5450 001 ...

Page 31: ...connected to all blades and service nodes via InfiniBand fabric Leader nodes have access to compute nodes in other racks via the leader node in that rack The gateway service node is the gateway from the InfiniBand fabric to services on the public network such as storage lightweight directory access protocol LDAP services file transfer protocol FTP Typically it is combined with the login batch serv...

Page 32: ...t GigE and 10 100 Ethernet Connections The SGI Altix ICE 8000 series system has several Ethernet networks that facilitate booting and managing the system These networks are built onto the backplane of each IRU for connection to the compute blades and transverse cables between IRUs and between racks Each compute blade has a Gigabit Ethernet GigE and 10 100 Ethernet connection to the backplane The G...

Page 33: ...witch and the sixteen blade BMCs connect to the 10 100 switch The GigE connections also connect the service nodes including service storage nodes The GigE switches in each IRU are stacked using a special stacking connection between each IRU in a rack This connection runs a special intra switch protocol All switches in a rack are ganged together to form one large 96 port switch The connections from...

Page 34: ...s CMC to the one immediately to the left If this is the left most rack this jack is unconnected R58 This is a connection for the IEEE 1588 timing protocol from this CMC to the one immediately to the right If this is the right most rack this jack is unconnected A NAS cube storage service node uses both the LL and RL jacks to connect to the Altix ICE system as shown in Figure 1 3 on page 12 For smal...

Page 35: ...BE VLAN connects the leader nodes to the GigE interfaces of all the compute blades VLAN_GBE and VLAN_BMC do not extend outside of any rack Therefore traffic on those VLANs stays local to each rack Only VLAN_HEAD extends rack to rack It is the network used by the admin node to communicate to the leader node of each rack and to each service node The rack leader controllers leader nodes must run 802 ...

Page 36: ...N_BMC Network Connections IRU View The VLAN_GBE and VLAN_BMC networks connect the leader node in a given rack with the compute nodes blades In the case of VLAN_BMC the network also connects the CMC with the compute blades and rack leader controller leader node 18 007 5450 001 ...

Page 37: ...I Altix ICE System Quick Reference Guide UID RESET Leader node Rack 01 UID RESET Leader node Rack 02 UID RESET Admin node Login node UID RESET Figure 1 7 VLAN_GBE and VLAN_BMC Network Connections Rack View 007 5450 001 19 ...

Page 38: ...min node Login node UID RESET Figure 1 8 VLAN_HEAD Network Connections In an SGI Altix ICE system with just one IRU the CMC s R58 and L58 ports are assigned to VLAN_HEAD by a field configurable setting This provides two additional Ethernet ports that can be use to connect service nodes to your system 20 007 5450 001 ...

Page 39: ...he bottom IRU in the rack Each IRU has two 24 port switches see Switch blade in Figure 1 9 on page 22 Each switch is on a seperate fabric On each switch 16 ports go to the 16 compute blades Each compute blade has two single port HCAs and each HCA connects to a fabric Therefore both switches connect to each blade Of the remaining eight ports on each switch currently six of them are used to connect ...

Page 40: ...h Two IRUs Network Interface Naming Conventions To simplify the deployment and management of the Altix ICE system the scaaltixice package includes functionality to automatically configure the system according to a fixed policy tailored for the hierarchical topology used in the Altix ICE system see Hardware Overview on page 1 The network policy implemented by the scaaltixice package is described in...

Page 41: ...ck leader controllers leader nodes The is the inter rack communication network Head BMC network Network for communication between admin node and the BMCs on service nodes and leader nodes This network is on the same VLAN as the head network Rack networks One per rack and provides the intra rack network for communication between the leader node and all the blades in a rack Rack BMC network One per ...

Page 42: ...h1 Connected to the head network IP 172 16 0 1 name admin and head BMC network IP 172 17 0 1 name admin mgm Service Nodes The service node networks implemented are as follows BMC Connected to the head BMC network IP 172 17 0 2 255 eth0 Connected to the head network IP 172 16 0 2 255 name hostname eth1 Optionally connected to the corporate network IP address and subnet mask set by the customer Name...

Page 43: ...ed vlan 2 connected to rack network IP 192 168 0 1 name rXXlead int ib0 Connected to IB subnet1 IP 10 0 XX 1 name rXXlead ib0 ib1 Connected to IB subnet2 IP 10 1 XX 1 name rXXlead ib1 Chassis Management Control CMC Blade The chassis management controller ethernet switch networks implemented are as follows Hostname is rXXcmc 01 04 Connected to the rack BMC network IP 192 168 1 2 5 name rXXcmc 01 04...

Page 44: ...Connected to the rack network IP 192 168 0 11 74 name r 01 xx i 01 04 n 01 16 eth0 ib0 Connected to IB subnet1 IP 10 0 XX 11 74 name r 01 xx i 01 04 n 01 16 ib1 Connected to IB subnet2 IP 10 1 XX 11 74 name r 01 xx i 01 04 n 01 16 ib1 26 007 5450 001 ...

Page 45: ...age 33 Installing Compute Nodes on page 33 Configuration Session Example on page 37 Scali Manage Troubleshooting Tips on page 39 Compute Node RPMs on page 40 Note SGI Altix ICE systems running Scali Manage software are shipped pre installed Instructions in this section for defining and discovering nodes can be used if you are expanding the initially delivered cluster or reinstalling your software ...

Page 46: ...trative information includes Root password sgisgi system admin controller admin node and compute nodes ipmitool user password information User ADMIN Password ADMIN Power on or Power off System Components or Obtain Status To power on or power off system componets use the Scali Manage power command To get a system console use the Scali Manage console command See the The Power Interface and The Conso...

Page 47: ...f the compute node For information on Scali Manage networking conventions used with the power commands see Network Interface Naming Conventions on page 22 Scali Manage Installer Directory The Scali Manage installer directory usr local Scali is the location of the code used to install Scali Cluster management Software The Factory Install directory is located on the admin node server at usr local Fa...

Page 48: ...rvicenode Define Altix ICE Service node s discoveraltixicecmc Discover CMC and Blade MAC addresses discoveraltixiceservicenode Find BMC MAC addresses for systems initaltixicesms Initate Scali Manage Server for Altix ICE poweraltixiceiru Control power to an Altix ICE IRU restartaltixiceopensm Restart the OpenSM subnetmanagers on the leadnodes Type help followed by command name for full documentatio...

Page 49: ... actions on your system Define all the network subnets according to the policy see Network Interface Naming Conventions on page 22 Add the eth1 and eth1 headbmc interfaces with preset IP addresses Load the SGI ProPack software stack Defining New Racks or Service Nodes To add one or more racks of compute nodes perform the following scalimanage cli definealtixicerack racknumbers To add multiple rack...

Page 50: ...ice node s and leader nodes BMCs must be discovered and configured see Discovering Service and Leader Nodes on page 32 Discovering Service and Leader Nodes Before new service or leader nodes can be installed the associated BMCs MAC addresses must be discovered and IP addresses must be assigned To do this perform the following scalimanage cli discoveraltixiceservicenode systemnames This will perfor...

Page 51: ... on SGI Altix ICE systems the InfiniBand network ib1 is to be used for storage traffic the InfiniBand network ib0 is to be used for MPI traffic and the Ethernet network is used only for system administration The node from which the compute node installation image is created must not have a service ib1 home NFS mount entry in the etc fstab file It is very likely that you will need to manually delet...

Page 52: ...There may be other NFS entries included in etc fstab file that also need deleting such as a service ib1 data entry or an entry or entries for off cluster NFS servers 3 Create an installation image from this node From the Altix ICE admin node run the scalimanage cli captureimage command You can get a usage statement for this command as follows scalimanage cli help captureimage captureimage systemna...

Page 53: ... follows scalimanage cli help addremotefs addremotefs systemnames fstype src mntpoint options _netdev Add mounting for remote filesystem on system s Arguments systemnames name of system s fstype type of filesystem legal values nfs lustre src source mntpoint mountpoint options options to mount command to be given as o options to mount By default options _netdev Options values should be comma sepera...

Page 54: ...odes in the GUI and select right click Node On Off Power Off 7 From the rack leader node use power command as follows r01lead power r01i01n0 2 6 off r01i01n02 SUCCESS r01i01n03 SUCCESS r01i01n04 SUCCESS r01i01n05 SUCCESS r01i01n06 SUCCESS r01lead power r01i01n0 2 6 status r01i01n02 OFF r01i01n03 OFF r01i01n04 OFF r01i01n05 OFF r01i01n06 OFF 8 From the rack leader node you can use the ipmitool as f...

Page 55: ... 1 r01i01n05 eth0 service1 ib1 home on home type nfs rw _netdev addr 10 1 0 1 r01i01n06 eth0 service1 ib1 home on home type nfs rw _netdev addr 10 1 0 1 Configuration Session Example This is section shows a complete SGI Altix ICE configuration example as follows scalimanage cli initaltixicesms tmp ofed stout5sp2 rpms tgz scalimanage cli definealtixicerack 1 1 2 1 2 scalimanage cli definealtixicese...

Page 56: ... Manage GUI Displaying Cluster Components Cluster components are shown in Figure 2 2 on page 39 r01 is rack 01 and r02 is rack 02 i01 is IRU 1 and n01 and n02 are nodes 1 and 2 r01lead and r02lead are the rack leader controllers leader nodes for the cluster service1 is the service nodes for the cluster System naming conventions when using Scali Manage are described in Network Interface Naming Conv...

Page 57: ...s well as emergency procedures Whenever a Scali cluster parameter is changed it is necessary to apply the configuration This can be done either through the graphical user interface GUI by selecting Provisioning Apply All Configuration Changes or via the command line interface CLI as follows scalimanage cli reconfigure all Changes can be made in batches and then applied all at once 007 5450 001 39 ...

Page 58: ...figuration files etc array arrayd auth conf with links to usr lib arrayd conf auth When you update your system configuration and later reboot the compute node s your configuration will be lost because the compute nodes are stateless You need to capture another image after changing configuration files Compute Node RPMs The following section describes what packages are installed on the compute node ...

Page 59: ...kSGI mpitests_mpt msr tool mstflint numatools ofed docs ofed scripts openib diags Compute Node RPMs on RHEL The following RPMs reside on the compute node when you run Scali Manage on top of Red Hat Enterprise Linux 5 RHEL5 cpuset utils dapl dapl devel dapl utils environment modules ibutils intel cluster runtime ipoibtools kernel ib ice kmod numatools kmod ofa_kernel kmod xpmem libbitmask libcpuset...

Page 60: ... devel libibverbs utils libmthca libopensm libosmcomp libosmvendor librdmacm librdmacm utils lkSGI mpitests_mpt msr tool mstflint numatools ofed docs ofed scripts openib diags pcp open perftest rds tools sgi arraysvcs sgi mpt sgi procset sgi release sgi support tools tvflash xpmem 42 007 5450 001 ...

Page 61: ...r controller leader node runs an instance of SM to manage the ib0 fabric and a second leader node runs an instance of SM to manage the ib1 fabric On a system with a single rack both instances of opensm run on the same rack leader node Each instance of SM on the rack leader controller is controlled by the etc opensm ib0 conf or etc opensm ib1 conf configuration file Rack leader controllers run the ...

Page 62: ...or etc opensm ib1 conf configuration files Note Currently the InfiniBand fabric ib0 is reserved for MPI or interprocess communication traffic and the InfiniBand fabric ib1 is reserved for storage For more information on the InfiniBand fabric see Appendix A InfiniBand Fabric Details on page 45 and Appendix B InfiniBand Fabric Troubleshooting on page 55 44 007 5450 001 ...

Page 63: ...during a light sweep such as the addition or deletion of a node it performs a heavy sweep The heavy sweep actually changes the fabric configuration to reflect the current state of the system A sample opensm ibx conf configuration file is as follows Example A 1 opensm ib0 conf and opensm ib conf Configuration Files DEBUG mode This option specifies a debug option These options are not normally neede...

Page 64: ...lows unlimited outstanding SMPs Without maxsmps OpenSM defaults to a maximum of one outstanding SMP MAXSMPS 0 REASSIGN_LIDS This option causes OpenSM to reassign LIDs to all end nodes Specifying REASSIGN_LIDS yes on a running subnet may disrupt subnet traffic With REASSIGN_LIDS no OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID REASSIGN_LIDS yes SWEEP This o...

Page 65: ... instead of the Min Hop algorithm which is default Valid routing engines are Min Hop updn file ftree lash To switch to different routing engine set the engine name in ROUTING_ENGINE i e ROUTING_ENGINE lash For Min Hop use ROUTING_ENGINE none or ROUTING_ENGINE ROUTING_ENGINE none GUID_FILE This option only allowed when UPDN algorithm is activated It specifies the guid list file from which to fetch ...

Page 66: ...LID This option forces OpenSM to honor the guid2lid file when it comes out of Standby state if such file exists under OSM_CACHE_DIR and is valid Set to honor_guid2lid or x to enable By default this is FALSE Will be set automatically to honor_guid2lid if OSM_HOSTS includes list of more then one IP addresses HONORE_GUID2LID x RCP This option osed by SLDD daemon for handover mechanism to copy local c...

Page 67: ...BOOT To start OpenSM automatically set ONBOOT yes ONBOOT yes MULTI_FABRIC Allow multiple fabrics and copies of OpenSM on the same SM host MULTI_FABRIC yes Each fabric is addressed by a global unqiue identifier GUID and unique HCA port see Figure A 1 on page 50 Each fabric has a unique GUID set in its respective configuration file 007 5450 001 49 ...

Page 68: ...ally based on the number of racks in the system For up to two racks the Min Hop algorithm is used For more than two racks the lash algorithm is used which enables LAyered SHortest Path Routing LASH When the lash routing algorithm is used the subnet managers need to be restarted after the entire Altix ICE system is up To restart the subnet managers perform the following command scalimanage cli rest...

Page 69: ... to the leader node or rack 1 as follows ssh r01lead Note Before you attempting to initialize the InfiniBand fabric make sure all compute nodes are booted and operational 2 From the admin node determine and record the IP addresses of the leader nodes as follows ping c 1 r01lead PING r01lead ice americas sgi com 172 16 0 2 56 84 bytes of data 64 bytes from r01lead ice americas sgi com 172 16 0 2 ic...

Page 70: ...mitted 1 received 0 packet loss time 0ms rtt min avg max mdev 0 136 0 136 0 136 0 000 ms 3 From the leader node issue an ibstat command to determine the Port GUID values as follows r01lead ibstat CA mthca0 CA type MT23108 Number of ports 2 Firmware version 3 3 3 Hardware version a1 Node GUID 0x0008f1040397b03c System image GUID 0x0008f1040397b03f Port 1 State Active Physical state LinkUp Rate 10 B...

Page 71: ...ditor open the opensm ib1 conf file and enter the Port GUID value in this example 0x0008f1040397b03e as follows GUID 0x0008f1040397b03e 7 In both the opensm ib0 conf file and opensm ib1 conf file enable the failover handover mechanism on the leader nodes by adding the IP addresses recorded in step 2 to the OSM_HOSTS variable as follows OSM_HOSTS 172 16 0 2 172 16 0 3 172 16 0 4 172 16 0 5 8 For sy...

Page 72: ...HCA 1 Ca 0x0030487aa7840000 ports 1 devid 0x6274 vendid 0x2c9 HCA 1 Ca 0x0030487aa79c0000 ports 1 devid 0x6274 vendid 0x2c9 HCA 1 Ca 0x0030487aa7900000 ports 1 devid 0x6274 vendid 0x2c9 HCA 1 Ca 0x0030487aa7980000 ports 1 devid 0x6274 vendid 0x2c9 HCA 1 Ca 0x0008f104039881a8 ports 2 devid 0x6278 vendid 0x8f1 HCA 1 Note Get usage information on the ibnetdiscover command as follows r01lead ibnetdisc...

Page 73: ...scover pl ibnetdiscover ib_rdma_bw ibstatus ibcheckerrors ibcheckwidth ibdmchk ibnlparse ib_rdma_lat ibswitches ibcheckerrs ibclearcounters ibdmsh ibnodes ib_read_bw ibsysstat ibchecknet ibclearerrors ibdmtr ibping ib_read_lat ibtopodiff ibchecknode ib_clock_test ibfindnodesusing pl ibportstate ibroute ibtracert ibcheckport ibdiagnet ibhosts ibprintca pl ib_send_bw ibv_asyncwatch ibcheckportstate ...

Page 74: ...n a0 Node GUID 0x0008f104039881a8 System image GUID 0x0008f104039881ab Port 1 State Initializing Physical state LinkUp Rate 20 Base lid 0 LMC 0 SM lid 0 Capability mask 0x02510a68 Port GUID 0x0008f104039881a9 Port 2 State Initializing Physical state LinkUp Rate 20 Base lid 0 LMC 0 SM lid 0 Capability mask 0x02510a68 Port GUID 0x0008f104039881aa The following shows output from the ibstat command af...

Page 75: ... 1 Capability mask 0x02510a6a Port GUID 0x0008f104039881aa ibstatus Command You can use the ibstatus less verbose that ibstat command to show the link rate as follows r01lead opt sgi sbin ibstatus Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0008 f104 0398 81a9 base lid 0x1 sm lid 0x1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR Infiniband device mthca0 port 2...

Page 76: ... performance counters from lid 32 port 1 perfquery e 32 1 read extended performance counters from lid 32 port 1 perfquery a 32 read performance counters from lid 32 all ports perfquery r 32 1 read performance counters and reset perfquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance...

Page 77: ...tdiscover help Usage ibnetdiscover d ebug e rr_show v erbose s how l ist g rouping H ca_list S witch_list V ersion C ca_name P ca_port t imeout timeout_ms switch map switch map topology file switch map switch map specify a switch map file Note Only abbreviated output is shown in the this example Some sample output from the ibnetdiscover command is as follows r01lead opt sgi sbin ibnetdiscover Topo...

Page 78: ...orts 1 devid 0x6274 vendid 0x2c9 r1i1n0 ib0 HCA 1 Ca 0x0030487aa7900000 ports 1 devid 0x6274 vendid 0x2c9 r1i1n8 ib0 HCA 1 Ca 0x0030487aa7980000 ports 1 devid 0x6274 vendid 0x2c9 r1i1n1 ib0 HCA 1 Ca 0x0008f104039881a8 ports 2 devid 0x6278 vendid 0x8f1 HCA 1 ibdiagnet Command The ibdiagnet command is a useful diagnostic tool To see a usage statement for the ibdiagnet command perform the following r...

Page 79: ... according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is displayed This report includes SM report Number of nodes and systems Hop count information maximal hop count an ...

Page 80: ...ion of the tool vars Prints the tool s environment variables and their values ERROR CODES 1 Failed to fully discover the fabric 2 Failed to parse command line options 3 Failed to interact with IB fabric 4 Failed to use local device or local port 5 Failed to use Topology File 6 Failed to load required Package Output which shows no errors means the system is operating correctly r01lead opt sgi sbin ...

Page 81: ...t to load the fabric to test it like this r01lead opt sgi sbin ibdiagnet c 5000 Loading IBDIAGNET from usr lib64 ibdiagnet1 2 Loading IBDM from usr lib64 ibdm1 2 W Topology file is not specified Reports regarding cluster links will use direct routes W A few ports of local device are up Since port num was not specified p option port 1 of device 1 will be used as the local port I Discovering the sub...

Page 82: ...oubleshooting I I No bad Links with logical state INIT were found I I PM Counters Info I I No illegal PM counters values were found I I Bad Links Info I I No bad link were found I Done Run time was 8 seconds 64 007 5450 001 ...

Page 83: ...of nodes 5 I individual rack unit IRU 10 InfiniBand fabric 21 configuration and operation overview 45 diagnostic commands ibdiagnet 60 ibnetdiscover 59 ibstat 56 ibstatus 56 perfquery 58 overview 43 utilities and diagnostics 55 introduction 1 L login service node 10 M main power 4 MPI default configuration Scali MPI 3 SGI MPT 3 N network interface naming conventions 22 networks Gigabit Ethernet Gi...

Page 84: ...MC 4 compute blades 5 main power 5 R rack leader controller 5 8 restarting the InfiniBand fabric after a system reboot 43 S Scali Manage management software 1 setting up serial over LAN connection 14 storage service node 11 system admin controller 5 8 system overview 1 V virtual local area networks VLANs 16 VLAN_1588 17 VLAN_BMC 17 VLAN_GBE 17 VLAN_HEAD 17 66 007 5450 001 ...

Reviews: