background image

Appendix B. Command outputs 

333

Figure B-5   Miswire example 6 - External miswire (swapped cables)

Figure B-6 on page 334 shows a D-Link miswire example on cage 8, hubs 4-7, and D1 links 
affected.

D0 

SN15

D1

SN14

D3

SN12

D2

SN13

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D3

SN12

D2

SN13

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

H7

H5

H4

H3

H2

H1

H0

H6

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D0 

SN15

D1

SN14

D2:

SN13

D3

SN12

D4

SN11

D5

SN10

06

SN9

D7

SN8

D8

SN7

D9

SN6

D10

SN5

D11

SN4

D12

SN3

D13

SN2

D14

SN1

D15

SN0

D-Link Miswire Example 6

Cage 12 Hubs 4 & 5 D3 links affected

External miswire (Swapped cables)

Summary of Contents for Power Systems 775

Page 1: ...tero Kerry Bosworth Puneet Chaudhary Rodrigo Garcia da Silva ByungUn Ha Jose Higino Marc Eric Kahle Tsuyoshi Kamenoue James Pearson Mark Perez Fernando Pizzano Robert Simon Kai Sun Unleashes computing...

Page 2: ......

Page 3: ...International Technical Support Organization IBM Power Systems 775 for AIX and Linux HPC Solution October 2012 SG24 8003 00...

Page 4: ...ication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp First Edition October 2012 This edition applies to IBM AIX 7 1 xCAT 2 6 6 IBM GPFS 3 4 IBM LoadLelever Parallel Environment...

Page 5: ...nit CAU 12 1 3 4 Nest memory management unit NMMU 14 1 3 5 Integrated switch router ISR 15 1 3 6 SuperNOVA 16 1 3 7 Hub module 17 1 3 8 Memory subsystem 20 1 3 9 Quad chip module QCM 21 1 3 10 Octant...

Page 6: ...mized for Power 775 clusters 113 2 4 Parallel Environment optimizations for Power 775 116 2 4 1 Considerations for using Host Fabric Interface HFI 116 2 4 2 Considerations for data striping with PE 12...

Page 7: ...ty 235 3 6 2 Integrated Switch Network Manager ISNM 235 Chapter 4 Problem determination 237 4 1 xCAT 238 4 1 1 xcatdebug 238 4 1 2 Resolving xCAT configuration issues 238 4 1 3 Node does not respond t...

Page 8: ...9 5 4 3 Availability Plus A resources in a Power 775 Cluster 289 5 4 4 How to identify a A resource 290 5 4 5 Availability Plus definitions 291 5 4 6 Availability plus components and recovery procedur...

Page 9: ...editions of the publication IBM may make improvements and or changes in the product s and or the program s described in this publication at any time without notice Any references in this information...

Page 10: ...h AIX 5L AIX BladeCenter DB2 developerWorks Electronic Service Agent Focal Point Global Technology Services GPFS HACMP IBM LoadLeveler Power Systems POWER6 POWER6 POWER7 PowerPC POWER pSeries Redbooks...

Page 11: ...R6 AIX SLES and Red Hat clusters and the new Power 775 system She has 12 years of experience at IBM with eight years in IBM Global Services as an AIX Administrator and Service Delivery Manager Puneet...

Page 12: ...allation and on going maintenance phases of clustered RISC and pSeries servers for numerous government and commercial customers beginning with SP2 and continuing through the current Power 775 HPC solu...

Page 13: ...published author too Here s an opportunity to spotlight your skills grow your career and become a published author all at the same time Join an ITSO residency project and help write a book in your are...

Page 14: ...comments to IBM Corporation International Technical Support Organization Dept HYTD Mail Station P099 2455 South Road Poughkeepsie NY 12601 5400 Stay connected to IBM Redbooks Find us on Facebook http...

Page 15: ...t scenarios that include xCAT configuration issues Integrated Switch Network Manager ISNM Host Fabric Interface HFI GPFS and LoadLeveler These scenarios show the flow of how to determine the cause of...

Page 16: ...8 POWER7 processors to do the work to solve the most challenging problems A total of 7 86 TF per CEC and 94 4 TF per rack highlights the capabilities of this high performance computing solution The ha...

Page 17: ...system to local diskless nodes Login nodes Provides a centralized login to the cluster Optional utility node Tape subsystem server For more information see this website http publib boulder ibm com in...

Page 18: ...Computing clients with the following benefits Sustained performance and low energy consumption for climate modeling and forecasting Massive scalability for cell and organism process analysis in life s...

Page 19: ...Read 1B Write 2B Read 1B Write 2B Read Core 4 FXU 4 FPU 4T SMT L2 256KB L3 4MB Core 4 FXU 4 FPU 4T SMT L2 256KB L3 4MB Core 4 FXU 4 FPU 4T SMT L2 256KB L3 4MB Core 4 FXU 4 FPU 4T SMT L2 256KB L3 4MB...

Page 20: ...on off ramps Asynchronous interface to chiplets and off chip interconnect Differential memory controllers 2 6 4 GHz Interface to Super Nova SN DDR3 support max 1067 Mhz Minimum Memory 2 channels 1 SN...

Page 21: ...and non coherent memory access IO operations interrupt communication and system controller communication The PowerBus provides all of the interfaces buffering and sequencing of command and data opera...

Page 22: ...Mem 3 PHY s Mem 3 PHY s MC0 MC1 Power Bus PSI A D HTM ICP PLLs EI 3 PHY s Z BUS 8B 8B W BUS 8B 8B X Bus 8B 8B Y Bus 8B 8B SMP Interconnect HUB Attach POR PSI JTAG FSI I2C ViDBUS I2C SEEPROM M1A 22b 1...

Page 23: ...rrors Protection key support for AIX L1 I D Cache Error Recovery and Handling Instruction retry for soft errors Alternate processor recovery for hard errors Guarding of core for core and L1 L2 cache e...

Page 24: ...rs of four POWER7 chips via the WXYZ links EI 3 PHYs Torrent Diff PHYs L local HUB To HUB Copper Board Wiring L remote 4 Drawer Interconnect to Create a Supernode Optical LR0 Bus Optical 6x 6x LR23 Bu...

Page 25: ...that interconnect the hub chips form the cluster network Figure 1 5 HFI attachment scheme POWER7 chips The set of four POWER7 chips QCM its associated memory and a hub chip form the building block for...

Page 26: ...POWER7 chip Packet ordering The HFIs and cluster network provide no ordering guarantees among packets Packets that are sent from the same source window and node to the same destination window and nod...

Page 27: ...U per four POWER7 chips or one CAU per 32 C1 cores Software organizes the CAUs in the system collective trees The arrival of an input on one link causes its forwarding on all other links when there is...

Page 28: ...e memory of a remote node that is written to memory by the HFI of the remote node The memory of a local node that is written to memory by the HFI of the local node A remote CAU Proc caches WXYZ link P...

Page 29: ...g interconnect packets that reference memory such as RDMA packets and packets that perform atomic operations contain an effective address and information that pinpoints the context in which to transla...

Page 30: ...software Target switch crossbar bandwidth greater than 1 TB per second input and output 96 Gbps WXYZ busses 4 24 Gbps from P7 chips unidirectional 168 Gbps local L busses 7 24 Gbps between octants in...

Page 31: ...r per CMD port Four differential memory clock pairs to support up to four DDR3 registered dual in line memory modules RDIMMs Data Flow Modes include the following features Expansion memory channel dai...

Page 32: ...ls are the same before and after spare lane invocation Spare lanes are selected by one of the following means The spare lanes are selected during initialization by loading host and SuperNOVA configura...

Page 33: ...ical 12x 12x 16 D Buses 28 I2C I2C_0 Int I2C_27 Int I2C To Optical Modules TOD Sync 8B Z Bus 8B Z Bus TOD Sync 8B Y Bus 8B Y Bus TOD Sync 8B X Bus 8B X Bus TOD Sync 8B W Bus 8B W Bus 24 Fiber D Link C...

Page 34: ...boards is 3 TBps the same bandwidth as 1500 10 Gb Ethernet links Connects up to 512 groups of four planers SuperNodes together via the D optical buses with 3 TBs of exiting bandwidth per planer Optic...

Page 35: ...erential pairs 10 GB per second 24 signals for each TX optics module 12 differential pairs 10 GB per second 24 signals for each RX module Three wire I2C TWS Serial Data Address Serial Clock Interrupt...

Page 36: ...DRAM DRAM DRAM DRAM DRAM DRAM DRAM 8B DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM 8B DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM SN DRAM...

Page 37: ...RAM ranks per SuperNova Port Water cooled jacketed design 50 watt max DIMM power Available in sizes 8 GB 16 GB and 32 GB RPQ For best performance it is recommended that all 16 DIMM slots are plugged i...

Page 38: ...ich contains one QCM one Hub module and up to 16 associated memory modules 8c uP 8c uP 8c uP 8c uP A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C...

Page 39: ...lk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp 48GB s pk 48GB s pk 48GB s 48GB s 48GB s 48GB s 44GB...

Page 40: ...ences 22x25mm 550 sqmm HUB 1 2 3 4 5 6 7 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 8 9 61x96mm Substrate Water Cooled Hub Module 2 1 View 8x Optical Connector Light Pipe HUB 0 61x96mm P...

Page 41: ...at are spread across several racks This configuration means multiple octants are available in which every octant is connected to every other octant on the system The following levels of interconnect a...

Page 42: ...of the Power 775 physically represented by the drawer also commonly referred as CEC A node is composed of eight octants and their local interconnect Figure 1 15 shows the CEC drawer from the front Fig...

Page 43: ...interconnect L Local L Local LL connects the eight octants in the CEC drawer together via the HUB module by using copper board wiring Every octant in the node is connected to every other octant as sh...

Page 44: ...ocessor features the following characteristics The clocking source follows the topology of the service processor N 1 redundancy of the service processor and clock source logic use two inputs on each p...

Page 45: ...Chapter 1 Understanding the IBM Power Systems 775 Cluster 31 Figure 1 18 Board 2nd level interconnect 1 024 cores The second level wiring connector count is shown in Figure 1 19 on page 32...

Page 46: ...the 8 Octants in Node 4 This requires 24 connection from each of the 8 Octant in Node 1 So every Octant has 24 connections from it and there are 8 Octants resulting in 192 connections 8 x 24 192 Conn...

Page 47: ...ully integrated switch fabric The fabric is a multitier hierarchical implementation that connects eight logical nodes octants together in the physical node server drawer or CEC by using copper L local...

Page 48: ...stem Figure 1 22 shows an example configuration of a 242 TFLOP Power 775 cluster that uses eight Supernodes and direct graph interconnect In this configuration there are 28 D Link cable paths to route...

Page 49: ...shown in Figure 1 23 Figure 1 23 Fully interconnected system example Network topology Optical D links connect Supernodes in different connection patterns Figure 1 24 on page 36 shows an example of 32...

Page 50: ...ux HPC Solution Figure 1 24 Supernode connection using 32D topology Figure 1 25 shows another example in which there is one D link between supernode pairs which supports up to 512 supernodes in a 1D t...

Page 51: ...th that is based on this information Packets that is injected into the interconnect by the HFI employ source route tables with the route partially determined Per port route tables are used to route pa...

Page 52: ...one L hop every hub within a supernode is directly connected via Lremote link to every other hub in the supernode LNMC needs to know only the local link status in this CEC Figure 1 27 L hop If an L r...

Page 53: ...at most are D hops This configuration often is represented as an L D L D L route The following methods are used to select a route is when multiples routes exist Software specifies the intermediate su...

Page 54: ...ng This section provides information about the IBM Power Systems 775 power packaging and cooling features 1 5 1 Frame The front view of an IBM Power Systems 775 frame is shown in Figure 1 30 Figure 1...

Page 55: ...Chapter 1 Understanding the IBM Power Systems 775 Cluster 41 Figure 1 31 Frame front view The rear view of the Power 775 frame is shown in Figure 1 32 on page 42...

Page 56: ...each rated at 27 KW 360 VDC One to six BPRs are populated in the BPCE depending on CEC drawers and storage enclosures in frame and the type of power cord redundancy that is wanted Bulk Power Control a...

Page 57: ...tely 260 KW N 1 Bulk Power mode means one of the four power cords are disconnected and the cabinet continues to operate normally BPCE concurrent maintenance is not conducted in N 1 bulk power mode unl...

Page 58: ...9 T36 1 Gb Ethernet for HMC connectors T19 T36 2x 10 100 Mb Ethernet port to plug in a notebook while the frame is serviced 2x 10 100 Mb spare connectors contain both eNet and Duplex RS422 T11 T19 T20...

Page 59: ...50 to 60Hz Rated current amps per phase 125A 100A Acceptable voltage tolerance machine power cord 180 259Vac 333 508Vac Acceptable frequency tolerance machine power cord 47 63Hz 47 63Hz 2X Bulk Power...

Page 60: ...se and manifolds assemblies and WCUs are shown in Figure 1 36 Figure 1 36 Hose and manifold assemblies The components of the WCU are shown in Figure 1 37 on page 47 Supply Manifold Return Manifold 1 x...

Page 61: ...WCU schematics Dual Float Sensor Asm Orientation in frame Pressure Relief Valve Vacuum Breaker System Supply to electronics Ball Valve Quick Connect System Return from electronics Chilled Water Return...

Page 62: ...ers per STOR Two Port Cards per STOR each with 3 SECs 32 SAS x4 ports four lanes each on 16 Port Cards Data Rates Serial Attach SCSI SAS SDR 3 0 Gbps per lane SEC to Drive Serial Attach SCSI SAS DDR 6...

Page 63: ...enclosure front view 1 6 2 High level description Figure 1 40 on page 50 represents the top view of a disk enclosure and highlights the front view of a STOR Each STOR includes 12 carrier cards six at...

Page 64: ...four lanes at 6 Gbps lane These incoming SAS lanes connect to the input SEC which directs the SAS traffic to the drives Each drive is connected to one of the output SECs on the port card with SAS SDR...

Page 65: ...cond port card in the STOR remains cabled to the CEC so that the STOR remains operational A customer minimum configuration is two SAS cables per STOR and four UPIC power cables per drawer to ensure pr...

Page 66: ...52 IBM Power Systems 775 for AIX and Linux HPC Solution Figure 1 42 Disk Enclosure internal view The disk carrier is shown in Figure 1 43 Figure 1 43 Disk carrier...

Page 67: ...network switches 1 7 1 Hardware Management Console The HMC runs on a single server and is used to help manage the Power 775 servers The traditional HMC functions for configuring and controlling the s...

Page 68: ...re to be deployed by using xCAT 1 7 3 Service node Systems management throughout the cluster is a hierarchical structure see Figure 1 44 to achieve the scaling and performance necessary for a large cl...

Page 69: ...on is shown on Figure 1 45 The SAS PCIe must reside in PCIe slot 16 and the HDDs in slots 15 and 14 The service node also contains a 1 Gb Enet PCIe card that is in PCIe slot 17 Figure 1 45 Service nod...

Page 70: ...when they are booted up Frame 3 CEC U25 DCCA B DCCA B DCCA A DCCA A CEC U19 DCCA B DCCA B DCCA A DCCA A CEC U17 DCCA A DCCA B DCCA A DCCA A CEC U15 DCCA B DCCA B DCCA A DCCA A CEC U11 DCCA B DCCA B D...

Page 71: ...memory that is attached to them A maximum of one LPAR per POWER7 chip supported A single LPAR resides on one two three or four POWER7 chips This configuration results in an Octant with the capability...

Page 72: ...therefore 1536 32 48 There are always redundant utility nodes that reside in different frames when possible If there is only one frame and multiple SuperNodes the utility node resides in different Sup...

Page 73: ...Chapter 1 Understanding the IBM Power Systems 775 Cluster 59 Figure 1 47 Eight octant utility node definition...

Page 74: ...cluster The management rack for a POWER 775 Cluster houses the different components such as the EMS servers IBM POWER 750 HMCs network switches I O drawers for the EMS data disks keyboard and mouse T...

Page 75: ...s means that when one EMS server goes down for any reason the other EMS server accesses the data The EMS servers are redundant from in this scenario but there is no automated high availability process...

Page 76: ...it Eclipse Tools APPLICATION xlC C OpenMP xlF Fortran OpenMP MPI MATH Libraries ESSL xlUPC X10 LAPI Open shmem TCP UDP SOCKETS Multi Link bonding LL Resource Mgr Pre emption HMC HFI CNM TotalView IBM...

Page 77: ...Parallel ESSL http publib boulder ibm com infocenter clre sctr vxrx topic com ibm cluster pessl doc pes slbooks html HPCS Toolkit http domino research ibm com comm research_ projects nsf pages hpcst i...

Page 78: ...html Linux http www ibm com systems power software lin ux Network Management InfiniBand Vendor supported tools http www infinibandta org HFI Switch management CNM Route management Failover recovery P...

Page 79: ...obal counter configuration Phased installation and Optical Link Connectivity Test OLCT ISR network hardware status Monitors for ISR HFI link and optical module events Command line queries to display n...

Page 80: ...management controller machine code MCRSA enables a single site or worldwide enterprise customer to maintain machine code entitlement to remote call in support for ISNM throughout the life of the MCRS...

Page 81: ...twork Management Controller The LNMC present on each node features the following primary functions Event management Aggregates local hardware events local routing events and remote routing events Rout...

Page 82: ...C functional blocks The LNMC also interacts with the EMS and with the ISR hardware to support the execution of vital management functions Figure 1 54 on page 69 provides a high level visualization of...

Page 83: ...k topology Supernode identification Drawer identification within the Supernode Frame identification from BPA via FSP Cage identification from BPA via FSP Expected neighbors table for mis wire detectio...

Page 84: ...ment Route management generates a set of appropriate route table updates and potentially some PRT1 and PRT2 events of its own Changed routes are downloaded via hardware access Event management sends o...

Page 85: ...sets up and monitors the hardware global counter The component also maintains information about the location of the ISR master counter and configured backups Response Async DB Queue Response Async Rou...

Page 86: ...S runs on the EMS and on each xCAT service node 1 9 2 DB2 IBM DB2 Workgroup Server Edition 9 7 for High Performance Computing HPC V1 1 is a scalable relational database that is designed for use in a l...

Page 87: ...a cluster Creating and managing stateless and diskless clusters Figure 1 56 xCAT architecture xCAT supports both Intel and POWER based architectures which provide operating system support for AIX Lin...

Page 88: ...ty to extend RMC monitoring efficiently Performance monitoring and aggregation that is based on TEAL and RMC Automating the deployment process Automate creation of LPARs in every CEC Automate set up o...

Page 89: ...at a time allowing the other nodes to still be running jobs 1 9 4 Toolkit for Event Analysis and Logging The Toolkit for Event Analysis and Logging TEAL is a robust framework for low level system even...

Page 90: ...s This RSCT component provides the security infrastructure that enables RSCT components to authenticate the identity of other parties Topology services subsystem This RSCT component provides node and...

Page 91: ...irtual disk NSDs Overview GPFS Native RAID integrates the functionality of an advanced storage controller into the GPFS NSD server Unlike an external storage controller in which configuration LUN defi...

Page 92: ...and small write data are automatically logged to solid state disks for improved performance GPFS Native RAID features This section describes three key features of GPFS Native RAID and how the function...

Page 93: ...aused by disks or other system components that transport or manipulate the data When an NSD client is writing data a checksum of 8 bytes is calculated and appended to the data before it is transported...

Page 94: ...disks each data and replica strips and a spare disk for rebuilding To decluster this array the disks are divided into seven tracks two strips per array The strips from each group are then spread acro...

Page 95: ...lerant 1 1 conventional array on the left the redundant disk is read and copied onto the spare disk which requires a throughput of seven strip I O operations When a disk fails in the declustered array...

Page 96: ...e physical disks pdisks in a recovery group across which data redundancy information and spare space are declustered The number of disks in a declustered array is determined by the RAID code width of...

Page 97: ...a VDisk plus the equivalent spare disk capacity of a declustered array For example a VDisk that uses the 11 strip wide 8 3p Reed Solomon code requires at least 13 pdisks in a declustered array with t...

Page 98: ...and Reporting Technology SMART data including the number of internal sector remapping events for each pdisk Pdisk discovery GPFS Native RAID discovers all connected pdisks when it starts and then reg...

Page 99: ...y Group DE configuration S S S S S S S S 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2...

Page 100: ...3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 2 4...

Page 101: ...5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 2 4 6 2...

Page 102: ...Performance Computing Toolkit for analyzing performance of parallel and serial applications For more information about cluster products see this website http publib boulder ibm com infocenter clresct...

Page 103: ...ides scalable support to more than one million tasks Hierarchical tree support Corefile support Separate corefile per task Supports lightweight corefile that can dump to output Supports third party sc...

Page 104: ...port OpenMP interoperability Support for third party resource manager APIs Support application Checkpoint Restart by using application containers AIX 6 1 Linux target is 2011 but dependent on communit...

Page 105: ...ubset of the debugged tasks PDB allows users to invoke a POE job or attach to a running POE job and place it under debug control pdb starts a remote dbx gdb session for each task of the POE job that i...

Page 106: ...in the application and for locating relationships between functions in your application to help you better understand the performance of the application HPCS Toolkit Figure 1 68 on page 93 shows the f...

Page 107: ...h serial or parallel job and offers several reporting options to track jobs and utilization by user group account or type over a specified time To support charge back for resource use LoadLeveler inco...

Page 108: ...on the AIX and Linux operating systems For more information about ESSL see this website http publib boulder ibm com infocenter clresctr vxrx topic com ibm cluster es sl doc esslbooks html 1 9 10 Paral...

Page 109: ...US or IP applications on all types of nodes However you cannot simultaneously call Parallel ESSL from multiple threads For more information about Parallel ESSL see this website http publib boulder ib...

Page 110: ...tions are rewritten entirely to these new languages We need to provide a mechanism for mature applications to to scale up to the size of clusters used today and use new hardware capabilities A range o...

Page 111: ...ogical Single Address Space Logical Multiple Address Spaces Multiple Machine Address Spaces Homogeneous Cores Heterogeneous Roadrunner Open MP OpenCL BG Open MP SM MPI HW Cache Power PERC Open MP SM M...

Page 112: ...hown in Figure 1 73 the Parallel Tools Platform PTP is an open source project that provides a highly integrated environment for parallel application development The environment is based on a standard...

Page 113: ...narios This chapter describes the application level characteristics of a Power 775 clustered environment and provides guidance to better take advantage of the new features that are introduced with thi...

Page 114: ...source that is used to store node specific information Each node mounts its own root directory and preserves its state in individual mounted root file systems When the node is shut down and rebooted a...

Page 115: ...table must be populated before mkdsklsnode for AIX or genimage for Linux is run Example 2 1 shows a litefile table for a Power 775 system Example 2 1 litefile table example tabdump litefile image fil...

Page 116: ...ur image You must reset the node configuration file so that after the statelite config is updated you rerun mkdsklsnode for AIX and genimage for Linux You verify the run by reviewing the install nim s...

Page 117: ...icular node the data is shown in Example 2 3 Example 2 3 Command to gather the data that is represented on the nodes df awk print 1 7 grep errlog statelite persistent node hostname var adm ras errlog...

Page 118: ...endspace 9437184 no o tcp_recvspace 9437184 no o udp_sendspace 131072 no o udp_recvspace 1048576 cp p xcatpost admin_files passwd etc passwd cp p xcatpost admin_files security user etc security user c...

Page 119: ...nd respective memory resources The remaining 25 one POWER7 processor and its resources is designated to the service node SN For more information about resource requirements for service and utility nod...

Page 120: ...n the first adapter 2 3 Application development There are several tool chain enhancements to aid parallel program development on Power 775 systems The enhancements provide support for the newest POWER...

Page 121: ...tecture for which code is generated The qtune compiler option tunes instruction selection scheduling and other architecture dependent performance enhancements to run best on a specific hardware archit...

Page 122: ...s in the hub chip 2 3 3 Unified Parallel C Unified Parallel C UPC is an explicitly parallel extension of the C programming language that is based on the PGAS programming model UPC preserves the effici...

Page 123: ...el loop optimizations Concepts of data affinity Data affinity refers to the logical association between a portion of shared data and a thread Each partition of shared memory space has affinity to a pa...

Page 124: ...By default a c file is compiled as a UPC program unless the option qsourcetype c is specified Execution environments The execution environment of a UPC program is static or dynamic In the static exec...

Page 125: ...ou must use the IBM Parallel Operating Environment POE For example to run the executable program a out enter the following command a out procs 3 msg_api pgas hostfile hosts In this command the followi...

Page 126: ...d more than once The default value for num is 80 If any thread exceeds its local stack size the executable file issues a message and immediately stops the execution of the application when possible If...

Page 127: ...e parameter filename is the name of a binding file which must be a plain text file If the binding file is not in current directory you must specify the path to that file The path is an absolute path o...

Page 128: ...entire option string in quotation marks if XLPGASOPTS contains any embedded white spaces For example specifying options in any of the following ways has the same effect XLPGASOPTS stacksize 1 mb nosta...

Page 129: ...ASOPTS stacksize 0x800kb is equivalent to XLPGASOPTS stacksize 2mb Compiling and running an example program This section provides a simple UPC program the commands to compile and execute the program a...

Page 130: ...r the following servers and processors IBM BladeCenter JS2 IBM POWERPC 450 IBM POWERPC 450D IBM POWER5 IBM POWER5 IBM POWERPC970 processors IBM Blue Gene P Subroutines ESSL 5 1 is the last release to...

Page 131: ...quation Subroutines PDTRTRI and PZTRTRI Triangular Matrix Inverse New Fourier Transform Subroutines PSCFTD and PDCFTD Multidimensional Complex Fourier Transforms PSRCFTD and PDRCFTD Multidimensional R...

Page 132: ...commands to and collecting distributed statistics from running jobs PESH also allows PNSD commands to run concurrently on multiple nodes Enhanced scalability such that PE is now designed to run one m...

Page 133: ...user This section focus on the enhancement for Power 775 cluster systems that have HFI CAU and some run jobs with large numbers of tasks 2 4 1 Considerations for using HFI The HFI provides the non co...

Page 134: ...SR Translates the packets into transactions Pushes the transactions to PowerBus attached devices in a POWER7 chip Packet ordering The HFIs and cluster network provide no ordering guarantees among pack...

Page 135: ...d MP_RDMA_ROUTE_MODE route modes include the following attributes Hardware direct hw_direct Software indirect sw_indirect Hardware direct striped hw_direct_striped Hardware indirect hw_indirect You ch...

Page 136: ...fers per HFI and these buffers are shared between windows If there is only a single LPAR on the octant there are 128 available hardware buffers An octant with four LPARs has 32 available hardware buff...

Page 137: ...f computation and communication for non blocking requests By default MP_FORCED_INTERRUPTS_ENABLED is set to no 2 4 2 Considerations for data striping with PE PE MPI depends on PAMI or LAPI as a lower...

Page 138: ...s communication among multiple adapter windows By using resources that are allocated by a resource manager PAMI LAPI opens multiple user space windows for communication Every task of the job opens the...

Page 139: ...the adapters To activate striping or failover for an interactive parallel job you must use the following settings for the MP_EUIDEVICE and MP_INSTANCES environment variables For instances from multipl...

Page 140: ...e 2 11 hfi_read_regs sample output hfi_read_regs p l hfi0 Nonwindow Registers Nonwindow General number_of_windows 0 8 0x100 256 isr_id 0x0000000000000000 0 rcxt_cache_window_flush_req 0 9 0x0 0 RCWFR...

Page 141: ...lid 0 0 0x0 Invalid PMR6 reserved1 1 17 0x0 0 PMR6 new real addr 18 51 0x0 0 PMR6 reserved2 52 56 0x0 0 PMR6 read target 57 57 0x0 Old PMR6 page size 58 63 0x0 Reserved page_migration_regs 7 0x0000000...

Page 142: ...000000000 0 agg ip pkt sent count 0x0000000000dad28f 14340751 agg cau pkt sent count 0x0000000000000000 0 agg gups pkt sent count 0x0000000000000000 0 addr xlat wait count 0x000000031589d5d5 132462565...

Page 143: ...ing 0 0 0 0 0 agg pkts received 1627181 865097 868630 877309 877294 agg pkts dropped receiving 0 0 0 0 0 agg imm send pkt count 24607 12304 12304 12304 12304 agg send recv pkt count 98409 98398 98394...

Page 144: ...orms the limited processing The system supports many trees simultaneously and each CAU supports 64 independent trees Software defines the tree to the hardware by programming the MMIO registers of each...

Page 145: ...more processors for a neighbor often reside on the same node as one of the processors CAUs that do not have any processors as a neighbor such as C1 C3 and C5 as shown in Figure 2 11 are on any node i...

Page 146: ...e run architectural limits for other components for example HFI windows might reduce the number of tasks supported To prevent performance degradation jobs that include large numbers of tasks need spec...

Page 147: ...default these files are written to tmp but the system administrator optionally uses the MP_DBG_TASKINFO_DIR entry in the etc poe limits file to change the directory into which these files are stored T...

Page 148: ...2ap05 hf0 c250f10c12ap05 hf0 c250f10c12ap05 hf0 c250f10c12ap09 hf0 c250f10c12ap09 hf0 c250f10c12ap09 hf0 c250f10c12ap09 hf0 c250f10c12ap09 hf0 c250f10c12ap09 hf0 c250f10c12ap09 hf0 c250f10c12ap09 hf0...

Page 149: ...10c12ap09 hf0 c250f10c12ap05 hf0 c250f10c12ap09 hf0 c250f10c12ap05 hf0 c250f10c12ap09 hf0 c250f10c12ap05 hf0 c250f10c12ap09 hf0 c250f10c12ap13 hf0 c250f10c12ap13 hf0 c250f10c12ap13 hf0 c250f10c12ap13...

Page 150: ...function calls for MPI and PAMI functions Run source code analysis to find parallel application coding problems such as mismatched MPI barriers Compile the application Run the application choosing res...

Page 151: ...these environment variables directly Instead you set them by using Eclipse dialogs The IBM HPC Toolkit requires your application to be compiled and linked by using the g flag If your application is no...

Page 152: ...ng OpenMP profiling Application I O profiling The peekperf GUI maps performance measurements to your source code and allows you to sort and filter the data Peekperf is started by issuing the command p...

Page 153: ...ites for each function Figure 2 16 on page 140 shows Peekperf MPI profiling data collection window The data visualization window shows the number of times each function call was executed the total tim...

Page 154: ...k access to the profiled data which helps you identify the functions that are the most CPU intensive By using the GUI you also manipulate the display to focus on the critical areas of the application...

Page 155: ...lication integration 141 Figure 2 17 Xprof Flat Profile window As shown in Figure 2 18 on page 142 the source code window shows you the source code file for the function you specified from the Flat Pr...

Page 156: ...performance of the application on POWER7 hardware with the intent of finding the bottlenecks of performance and looking for opportunities to optimize the target application code By using hpccount you...

Page 157: ...system performed INPUT 0 Number of times file system performed OUTPUT 0 Number of IPC messages sent 0 Number of IPC messages received 0 Number of signals delivered 0 Number of voluntary context switc...

Page 158: ...ed however if the keyword is not used LoadLeveler assumes that the job command file is the executable As shown in Example 2 14 the llclass command is used to ascertain class information about the curr...

Page 159: ...PI job bin ksh job_name myjob openmp job_type parallel class X_Class output job_name out error job_name err task_affinity core 4 cpus_per_core 1 parallel_threads 4 queue export OMP_NUM_THREADS 4 expor...

Page 160: ...ging jobs Querying and managing jobs is described in this section Querying job status After a job is submitted by using the llsubmit command you use the llq command to query and display the LoadLevele...

Page 161: ...50f10c12ap02 hf0 ppd pok ibm com 49 0 Job Step Id c250f10c12ap02 hf0 ppd pok ibm com 49 0 Job Name myjob mpi Step Name 0 Structure Version 10 Owner itsohpc Queue Date Fri Oct 28 10 10 12 2011 Status R...

Page 162: ...Dependency Resources ConsumableCpus 1 Node Resources CollectiveGroups 64 Step Resources Step Type General Parallel Node Usage shared Submitting Host c250f10c12ap02 hf0 ppd pok ibm com Schedd Host c250...

Page 163: ...ks False Monitor Program Node Name Requirements Preferences Node minimum 2 Node maximum 2 Node actual 2 Allocated Hosts c250f10c12ap09 hf0 ppd pok ibm com MCM0 CPU 0 MCM0 CPU 1 MCM0 CPU 2 MCM0 CPU 3 M...

Page 164: ...Task Instance c250f10c12ap09 hf0 18 MCM0 CPU 18 Task Instance c250f10c12ap09 hf0 19 MCM0 CPU 19 Task Instance c250f10c12ap09 hf0 20 MCM0 CPU 20 Task Instance c250f10c12ap09 hf0 21 MCM0 CPU 21 Task In...

Page 165: ...ance c250f10c12ap13 hf0 63 MCM3 CPU 55 1 job step s in query 0 waiting 0 pending 1 running 0 held 0 preempted Querying machine status You use the llstatus command to ascertain the status information a...

Page 166: ...nning tasks Total Machines 5 machines 1 jobs 64 running tasks The Central Manager is defined on c250f10c12ap01 hf0 ppd pok ibm com The BACKFILL scheduler is in use The following machine is absent c250...

Page 167: ...u use the keywords that are listed in Table 2 3 to specify how LoadLeveler assigns tasks to nodes Various task assignment keywords are used in combination and other keywords are mutually exclusive Tab...

Page 168: ...ine one of the types of affinity options task_mcm_allocation memory_affinity and adapter_affinity which result in the following values task_mcm_allocation options mcm_accumulate tasks are placed on th...

Page 169: ...n on the same nodes that are in use you reduce the setting so other jobs are able to use the nodes You check the current available collective_groups on the machine in the LoadLeveler cluster by using...

Page 170: ...156 IBM Power Systems 775 for AIX and Linux HPC Solution...

Page 171: ...general and key component tests list configurations and access monitored data for post processing in external systems This chapter describes the monitoring tools for the following components LoadLeve...

Page 172: ...at index php tit le Setting_up_LoadLeveler_in _a_Stateful_Cluster Initiali ze_and_Configure_LoadLeveler HTML SourceForge Setting up LoadLeveler in a Statefull Cluster Tivoli Workload Scheduler LoadLev...

Page 173: ...ix cre 2Fcre kickoff htm HTML PDF Information Center AIX 7 1 Information Center Linux Red Hat http docs redhat com docs en US index html HTML PDF RedHat all products ISNM Management Guide PDF High per...

Page 174: ...iki xcat index php tit le XCAT_AIX_Diskless_Nodes HTML SourceForge Diskless resources Linux http sourceforge net apps mediawiki xcat index php tit le XCAT_Linux_Statelite TEAL Check Table 3 4 on page...

Page 175: ...system mmlsfs Displays file system attributes mmlsmgr check Displays which node is the file System Manager for the specified file systems or which node is the cluster manager mmlsmount Lists the nodes...

Page 176: ...e nodes rscan Collects node information from one or more hardware control points DB2 db2_local_ps Display DB2 processes status db2ilist List DB2 instances db2level Displays information about the code...

Page 177: ...ormation about miswired links lsnwtopo Displays cluster network topology information nwlinkdiag Diagnoses an ISR network link HFI 775 only hfi_read Reads data from HFI logical device hfi_read_regs Dis...

Page 178: ...scription as shown in Figure 3 2 on page 165 Typical output examples are shown in Example 2 18 on page 151 Example 2 18 on page 151 and Example 2 19 on page 152 Diskless resource Network File System s...

Page 179: ...e mmlsdisk mmlsfs mmlsmount mmlsnsd These commands are described in the GPFS information in Table 3 1 on page 158 Although these commands belong to the GPFS base product the output from the commands i...

Page 180: ...A2DATA 6442024960 1 no yes 6442008576 100 13824 0 000DE22BOTDA3DATA 6442024960 1 no yes 6442008576 100 13824 0 000DE22BOTDA4DATA 6442024960 1 no yes 6442008576 100 13824 0 000DE22TOPDA1DATA 6442024960...

Page 181: ...r 2 Number of quorum nodes active in the cluster 2 Quorum 2 Quorum achieved mmlsdisk Example 3 3 shows the mmlsdisk command output Example 3 3 mmlsdisk command output mmlsdisk gpfs1 disk driver sector...

Page 182: ...es m 1 Default number of metadata replicas M 2 Maximum number of metadata replicas r 1 Default number of data replicas R 2 Maximum number of data replicas j scatter Block allocation type D nfs4 File l...

Page 183: ...l0 ppd pok ibm com mmlsnsd Example 3 6 shows the mmlsnds command output Example 3 6 mmlsnsd command output mmlsnsd a L File system Disk name NSD volume ID NSD servers gpfs1 000DE22BOTDA1META 140A0C0D4...

Page 184: ...ut examples are shown in Example 3 7 on page 171 Example 3 8 on page 172 Example 3 9 on page 172 and Example 3 10 on page 174 Figure 3 3 mmlsvdisk command flag description The mmlsvdisk command lists...

Page 185: ...TA 8 3p 000DE22BOT DA1 16384 000DE22BOTDA1META 4WayReplication 000DE22BOT DA1 2048 000DE22BOTDA2DATA 8 3p 000DE22BOT DA2 16384 000DE22BOTDA2META 4WayReplication 000DE22BOT DA2 2048 000DE22BOTDA3DATA 8...

Page 186: ...ry group 000DE22TOP vdisk name 000DE22TOPLOG raidCode 3WayReplication recoveryGroup 000DE22TOP declusteredArray LOG blockSizeInKib 512 size 8240 MiB state ok remarks log vdisk name 000DE22TOPDA1META r...

Page 187: ...SizeInKib 2048 size 1000 GiB state ok remarks vdisk name 000DE22TOPDA3DATA raidCode 8 3p recoveryGroup 000DE22TOP declusteredArray DA3 blockSizeInKib 16384 size 6143 GiB state ok remarks vdisk name 00...

Page 188: ...mation about all of the connected DEs to the LPAR where the command is executed The command generates the following input and output values Command mmgetpdisktopology Flags None Outputs See Example B...

Page 189: ...ist Portcards detected ses Portcard logical SAS device controller one for each row of disks four rows total mpt_sas Power 775 SAS Card Link logical disks Disks visible in a specific path diskset_id Di...

Page 190: ...sees both portcards P1 C60 P1 C61 Portcard P1 C60 ses32 0154 mpt2sas2 mpt2sas1 24 diskset 27327 ses33 0154 mpt2sas2 mpt2sas1 24 diskset 43826 Portcard P1 C61 ses40 0154 mpt2sas3 24 diskset 27327 ses41...

Page 191: ...commands that are commonly used to get the status of some services nodes and hardware availability We also introduce a new component specific to the IBM Power 775 cluster power consumption statistics...

Page 192: ...onnection state for the primary FSP BPA are NOT LINE UP The command requires that the connection for the primary FSP BPA should be LINE UP Example 3 13 rpower output example for all xcat cec group nod...

Page 193: ...flag description The nodestat command returns information about the status of the nodes operating system and power state and updates the xCAT database with information This command also supports custo...

Page 194: ...02ap09 hf0 noping Not Activated c250f06c02ap13 hf0 noping Not Activated c250f06c02ap17 hf0 noping Not Activated c250f06c02ap21 hf0 noping Not Activated c250f06c02ap25 hf0 noping Not Activated c250f06c...

Page 195: ...ageDC ambienttemp exhausttemp CPUspeed renergy noderange V savingstatus on off cappingstatus on off cappingwatt watt cappingperc percentage Power 7 server specific renergy noderange V all savingstatus...

Page 196: ...tts between and including cappingmin and cappingmax values permitted cappingvalue cappingperc 0 100 Capping energy consumption value in percentage of the value between the cappingmin and cappingmax va...

Page 197: ...10 12 1 ffoNorm 3836 MHz 40 10 12 1 ffovalue 3836 MHz Example 3 18 renergy output example for all the energy values turning savingstatus on renergy fsp savingstatus on 40 10 12 1 Set savingstatus succ...

Page 198: ...foNorm 3836 MHz 40 10 12 1 ffovalue 3836 MHz Example 3 20 renergy output example setting up cappingvalue and turning cappingstatus on renergy fsp cappingperc 0 40 10 12 1 Set cappingperc succeeded 40...

Page 199: ...format Outputs no flags Device type model serial number side ip addresses hostname device type model serial number side ip addresses hostname Attributes device BPA FSP CEC HMC type model XXXX YYY mod...

Page 200: ...2C 02C6906 B 0 40 6 3 2 40 6 3 2 FSP 9125 F2C 02C6946 A 0 40 6 4 1 40 6 4 1 FSP 9125 F2C 02C6946 B 0 40 6 4 2 40 6 4 2 FSP 9125 F2C 02C6986 A 0 40 6 5 1 40 6 5 1 FSP 9125 F2C 02C6986 B 0 40 6 5 2 40 6...

Page 201: ...e model XXXX YYY model type serial number IBM 7 hexadecimal serial number The serial number associates all the LPARs belonging to a determined FSP side A B address IP for the hardware component Exampl...

Page 202: ...leaveMode pending_interleave_mode CurrentMemoryInterleaveMode active_interleave_mode Designations PumpMode The pump mode value includes the following valid options 1 Node Pump Mode 2 Chip Pump Mode Oc...

Page 203: ...2 13 17 553 U78A9 001 1122233 P1 C5 0x21010229 2 17 17 552 U78A9 001 1122233 P1 C6 0x21010228 2 17 17 545 U78A9 001 1122233 P1 C7 0x21010221 2 17 17 544 U78A9 001 1122233 P1 C8 0x21010220 2 17 2 521 U...

Page 204: ...oltype are fnm or lpar Default is lpar s Displays the connection with HMC V Displays the time that is needed to retrieve the information Outputs T noderange fsp sp primary secondary ipadd A A A A alt_...

Page 205: ...lt_ipadd unavailable state LINE UP f06cec02 40 6 2 2 sp primary ipadd 40 6 2 2 alt_ipadd unavailable state LINE UP f06cec03 40 6 3 1 sp secondary ipadd 40 6 3 1 alt_ipadd unavailable state LINE UP f06...

Page 206: ...12 2 sp primary ipadd 40 6 12 2 alt_ipadd unavailable state LINE UP 40 6 2 1 40 6 2 1 sp secondary ipadd 40 6 2 1 alt_ipadd unavailable state LINE UP 40 6 2 2 40 6 2 2 sp primary ipadd 40 6 2 2 alt_ip...

Page 207: ...on for DB2_instance usr local bin isql v DB2_instance Example 3 30 db2ilist output example when DB2 instances are listed db2ilist xcatdb Example 3 31 db2level output when DB2 version details and produ...

Page 208: ...3 AIX and Linux system monitoring commands equivalencies AIX Linux Command Description nmon nmon Nigel Monitor for General system overview topas top OS specific system overview ps ps Displays process...

Page 209: ...nel AIX NMON views Hot Fabric Interface HFI logical networks for IP communication only in AIX as shown in Figure 3 12 Figure 3 12 NMON example for IP over HFI monitoring in AIX Linux NMON views HFI lo...

Page 210: ...llowing commands we show output data that helps in identifying problems or specific debugging cases iostat mount vmstat netstat entstat Example 3 35 iostat output example iostat System configuration l...

Page 211: ...10 58 rw log dev loglv00 dev mntdb2lv mntdb2 jfs2 Oct 04 10 59 rw log dev loglv00 Example 3 37 vmstat output example vmstat System configuration lcpu 32 mem 31616MB kthr memory page faults cpu r b av...

Page 212: ...eue Overflow 0 Current S W H W Transmit Queue Length 0 Broadcast Packets 10314 Broadcast Packets 2980396 Multicast Packets 10605 Multicast Packets 3028 No Carrier Sense 0 CRC Errors 0 DMA Underrun 0 D...

Page 213: ...is shown in Figure 3 14 A typical output is shown in Example 3 40 on page 200 Figure 3 14 lsnwcomponents command flag description The lsnwcomponents command lists integrated switch network components...

Page 214: ...2 MTMS 9125 F2C 02C6A46 FR006 CG12 SN005 DR1 FSP Primary ip 40 6 11 2 MTMS 9125 F2C 02C6A66 FR006 CG13 SN005 DR2 FSP Primary ip 40 6 12 2 MTMS 9125 F2C 02C6A86 FR006 CG14 SN005 DR3 FSP Primary ip 40 6...

Page 215: ...HB0 LR23 DOWN_FAULTY Service_Location U78A9 001 99200FX P1 T9 Link FR006 CG10 SN004 DR3 HB0 LR17 DOWN_FAULTY Service_Location U78A9 001 P1 T9 Link FR006 CG10 SN004 DR3 HB1 LR17 DOWN_FAULTY Service_Loc...

Page 216: ...one of the link connection expected_physical_location_final_link_endpoint Expected endpoint two of the same link connection actual_physical_location_link_endpoint Actual configured endpoint two of the...

Page 217: ...red Backups number_backup_nodes FRAME CAGE SUPERNODE DRAWER HUB Broadcast Frequency in 2 Ghz counter cycles broadcast_freq Takeover Trigger Number of Broadcasts takeover_trigger Invalid Trigger Number...

Page 218: ...kinfo command flag description The lsnwlinkinfo command lists the integrated switch network links information The input and output values are the same as the values from the lsnwexpnbrs command The ls...

Page 219: ...ME CAGE SUPERNODE DRAWER FRAME Frame number FRXXX CAGE Cage number CGXX SUPERNODE Supernode number SNXXX DRAWER Drawer number DR 0 3 Example 3 46 lsnwloc command output FR006 CG13 SN005 DR2 STANDBY FR...

Page 220: ...D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D11 SN4 D12 SN3...

Page 221: ...more information see Network topology on page 35 This scheme represents X links of type D links As a result this attribute is represented as XD Example 3 47 lsnwtopo command output lsnwtopo ISR netwo...

Page 222: ...0 RCWFR win flush busy 0 0 0x0 Inactive RCWFR reserved 1 7 0x0 0 RCWFR window id 8 15 0x0 0 Nonwindow Page Migration page_migration_regs 0 0x0000000000000000 0 PMR0 valid 0 0 0x0 Invalid PMR0 reserve...

Page 223: ...0000000000 0 PMR5 valid 0 0 0x0 Invalid PMR5 reserved1 1 17 0x0 0 PMR5 new real addr 18 51 0x0 0 PMR5 reserved2 52 56 0x0 0 PMR5 read target 57 57 0x0 Old PMR5 page size 58 63 0x0 Reserved page_migrat...

Page 224: ...Performance Counters HFI agg pkts sent 0x0000000008ff67c3 150955971 agg pkts dropped sendng 0x0000000000000000 0 agg pkts received 0x0000000008f168e5 150038757 agg pkts dropped rcving 0x0000000000000...

Page 225: ...xTkn 0 _SEL Errors 0 Info 0 API 0 Buffer 0 Perf 0 _SEI Error 0 API 0 Mapping 0 Milestone 0 Diag 0 _SRA API 0 Errors 0 Wherever 0 _SKD Errors 0 Info 0 Debug 0 _SEA Errors 0 Info 0 API 0 Buffer 0 SVCTKN...

Page 226: ...IBM MicroSensorRM 1 11 1 1 0 IBM SensorRM 1 12 25 1 0 IBM ServiceRM 1 13 18 1 0 IBM TestRM 1 14 1 1 0 IBM WLMRM 1 15 1 1 0 Highest file descriptor in use is 25 Internal Daemon Counters 0000 GS init at...

Page 227: ...reed 0 ACL s created 3 ACL s freed 0 0150 Sec rec methods 20 Sec authent 6124 Missed sec rsps 0 Wake sec thread 6145 0160 Wake main thread 6143 Enq sec request 6145 Deq sec request 6145 Enq sec respon...

Page 228: ...tion lsnim h usage To display predefined and help information about NIM objects types classes and subclasses lsnim p P S lsnim p P c object class s object subclass t object type l o O Z lsnim p P a at...

Page 229: ...c_base resources installp_bundle IBMhpc_all resources installp_bundle GOLD_71Bdskls_util resources spot GOLD_71Bdskls_util_shared_root resources shared_root Example 3 53 lsnim output example for resol...

Page 230: ...must be configured to take advantage of the operating system For more information about how to configure this support see Table 3 1 on page 158 For the Linux based cluster you must configure the devi...

Page 231: ...nfs processes status Linux only service nfs status rpc svcgssd is stopped rpc mountd pid 2700 is running nfsd pid 2697 2696 2695 2694 2693 2692 2691 2690 is running rpc rquotad pid 2684 is running TFT...

Page 232: ...erated and then forwarded to a listener The listeners also act as plug ins to the base framework of TEAL The following plug ins are supported RMC sensor File based file stderr or stdout SMTP to send e...

Page 233: ...teal pnsd ISNM CNM and hardware server teal isnm CNM and Hardware Server events Command Command description teal Runs TEAL tool python script for real time and historical analysis tlchalert Closes an...

Page 234: ...For this command we list the help description that is shown in Figure 3 26 on page 221 A typical output is shown in Example 3 59 on page 222 tlchalert log Command log files tllsalert log tllschkpt log...

Page 235: ...so Valid query values and their operations and formats rec_id A single id or a comma separated list of ids equals only alert_id A single id or a comma separated list of ids equals only creation_time A...

Page 236: ...CG14 SN008 DR3 HB2 HF0 RM0 10 BDFF0070 2011 10 27 09 28 02 211962 H FR010 CG14 SN008 DR3 HB0 HF1 RM3 tllsckpt For this command we list the help description shown in Figure 3 27 A typical output is sh...

Page 237: ...rations and formats rec_id A single id or a comma separated list of ids equals only event_id A single id or comma separated list of event ids time_occurred A time value in the format YYYY MM DD HH MM...

Page 238: ...erent types of cluster nodes Table 3 8 Component distribution over node type For more information For more information about high performance clustering by using the 9125 F2C see this website https ww...

Page 239: ...information such as the status of HFI adapters the status of HFI interfaces HFI IP address HFI MAC address HFI statistics data from HFI logical device HFI registers values and dump information of HFI...

Page 240: ...teal f TEAL tllsckpt f text g RSCT lssrc a grep rsct h NIM lssrc g nim i SSH lssrc g ssh j NFS lssrc g nfs k TCPIP lssrc g tcpip Check used daemons such as dhcpsd xntpd inetd tftpd named and snmpd l...

Page 241: ...machine 3 4 EMS Availability This section describes the Availability functionality that is implemented with the 775 cluster systems when two EMS physical machines are used For more information For mo...

Page 242: ...Unexport the following xCAT NFS directories exportfs ua 4 Stop DB2 database a su xcatdb b db2 connect reset c db2 force applications all d db2 terminate e db2stop f exit 5 Unmount all shared disk file...

Page 243: ...nstall iii mount xcat iv mount databaseloc b On Linux i mount dev sdc1 etc xcat ii mount dev sdc2 install iii mount dev sdc3 xcat iv mount dev sdc4 databaseloc 6 Update DB2 configuration a AIX opt IBM...

Page 244: ...ode to perform operating system OS deployment tasks Complete the following steps to use the primary management node to perform OS deployment tasks 1 Create operating system images For Linux The operat...

Page 245: ...SH issues when you are attempting to secure the compute nodes or any other nodes the hostname in the SSH keys under the HOME ssh directory must be updated e Run nimnodeset or mkdsklsnode Before run ni...

Page 246: ...s Table 3 9 summarizes the available configuration listing commands Commands overview Table 3 9 Available commands for component configuration listing Component Command AIX Linux Description LoadLevel...

Page 247: ...mon is configured to start lsnwconfig Displays the active network management configuration parameters HFI 775 only lslpp L grep i hfi Displays the current HFI driver version installed hfi snap Gathers...

Page 248: ...figuration to this compute node For more information see Table 3 1 on page 158 3 5 2 GPFS We list the GPFS cluster configurations by using the GPFS commands that are shown in Table 3 9 on page 232 By...

Page 249: ...c flag must be used with i flag h help Display usage message i attr list Comma separated list of attribute names to display l long List the complete object definition s short Only list the object name...

Page 250: ...f xCAT built SVN build number date Build date xcat_user Configuration user xcat_password Configuration user password database_instance Database instance name database_name Database name database_path...

Page 251: ...25 hf0 node c250f10c12ap29 hf0 node cec12 node frame10 node llservice node lsxcatd For this command we list the help description that is shown in Figure 3 30 A typical output is shown in Example 3 64...

Page 252: ...ation for the IBM Power Systems 775 cluster that requires monitoring attention However if such monitoring is required there is more documentation for DB2 settings as shown in Table 3 1 on page 158 3 5...

Page 253: ...on parameter values from Cluster Database Performance Data Interval 300 seconds Performance Data collection Save Period 168 hours No of Previous Performance Summary Data 1 RMC Monitoring Support ON 1...

Page 254: ...files are listed as shown in Table 3 9 on page 232 3 5 8 Reliable Scalable Cluster Technology For the AIX and Linux operating systems the only subsystem startup configuration that must be checked is...

Page 255: ...name is missing in etc hosts Three LPARs are inactive the operating system is not running One LPAR is running but is not pinged Example 3 68 Three nodestat output problems nodestat lpar p c250f10c12ap...

Page 256: ...242 IBM Power Systems 775 for AIX and Linux HPC Solution...

Page 257: ...The problems might be discovered as a result of running a command or performing a monitoring procedure as described in Chapter 2 Application integration on page 99 Potential problems are also provide...

Page 258: ...e specified in the input files in the opt xcat share xcat tools tracelevel files For more information see the xcatdebug man pages The Perl Debug Trace rpm is available in the latest xcat packages More...

Page 259: ...eps 2 Verify whether xdsh node date runs without prompting for password 3 Run the updatenode k command to update the keys or certificates on the nodes as shown in Example 4 3 Example 4 3 To update the...

Page 260: ...m 10 0 0 138 c 5 aixremoteshell servicenode 2 1 c250f06c10ap01 Redeliver certificates has completed Running internal xCAT command makeknownhosts Running command on c250mgrs38 pvt cat ssh known_hosts s...

Page 261: ...ec02 40 6 2 1 sp secondary ipadd 40 6 2 1 alt_ipadd unavailable state LINE UP f06cec02 40 6 2 2 sp primary ipadd 40 6 2 2 alt_ipadd unavailable state LINE UP f06cec11 40 6 11 2 sp primary ipadd 40 6 1...

Page 262: ...rver is not configured Complete the following steps to monitor the network installation of a node 1 Configure conserver by using the following command makeconservercf 2 Open a remote console to the no...

Page 263: ...tus of hdwr_svr and cnm ps ef grep isnm root 6750532 1 0 Oct 12 34 54 opt isnm hdwr_svr bin hdwr_svr Dhcdaemon Dfsp run_hmc_only root 7668424 1 0 Oct 14 93 05 opt isnm cnm bin cnmd To recycle the hdwr...

Page 264: ...starting now 13 19 24 485371 cnm_glob cpp 261 Unable to Connect to XCATDB 13 19 24 490633 cnm_glob cpp 277 Unable to get configuration parameters from isnm_config table 13 19 24 492224 cnm_glob cpp 3...

Page 265: ...1 Unable to Connect to XCATDB 13 19 24 490633 cnm_glob cpp 277 Unable to get configuration parameters from isnm_config table 13 19 24 492224 cnm_glob cpp 313 Unable to get cluster information from XCA...

Page 266: ...ion issue with slightly different messages posted in var opt isnm cnm log eventSummary log is shown in Example 4 12 Example 4 12 CNM to DB2 communication scenario case tail var opt isnm cnm log eventS...

Page 267: ...at are defined in var opt isnm hdwr_svr data HmcNetConfig One of the connections is for xCAT and includes a connection type tooltype of lpar The other connection is for CNM and includes an FNM connect...

Page 268: ...8E88196C506235B4DA840D532A81CEDB7868B5E3B5CBF59B9 419613766C28BB0ED8F262F827E95CB3BAECDD8C5700F75EA058C5DF158AE26419C6E5255 authtoklastupdtimestamp 1319637503483 mtms 78AC 100 992003H slot B ignorepwd...

Page 269: ...25 F2C 02C6946 FR006 CG06 SN051 DR3 FSP Backup ip 40 6 9 1 MTMS 9125 F2C 02C6A26 FSP Primary ip 40 6 12 2 MTMS 9125 F2C 02C6A86 FR006 CG14 SN005 DR3 FSP Primary ip 40 6 6 2 MTMS 9125 F2C 02C69B6 FR006...

Page 270: ...006 CG07 SN004 DR0 HB0 3 FR006 CG07 SN004 DR0 HB1 3 3 FR006 CG07 SN004 DR0 HB1 2 FR006 CG07 SN004 DR0 HB6 4 3 FR006 CG07 SN004 DR0 HB1 7 FR006 CG07 SN004 DR0 HB7 7 2 FR006 CG07 SN004 DR0 HB0 42 FR006...

Page 271: ...8D Supernode 5 Drawer 2 Frame 6 Cage 12 Topology 8D Supernode 5 Drawer 1 Frame 6 Cage 10 Topology 8D Supernode 4 Drawer 3 Frame 6 Cage 11 Topology 128D Supernode 5 Drawer 0 RUNTIME EXCLUDED Frame 6 C...

Page 272: ...ect LNMC topology is also identified via the var opt isnm cnm log eventSummary log as shown in Example 4 23 on page 258 Example 4 23 Identified incorrect LNMC topology tail var opt isnm cnm log eventS...

Page 273: ...in use by CNM 8D lsnwtopo f 6 c 11 Frame 6 Cage 11 Topology 8D Supernode 5 Drawer 0 4 3 HFI The section describes an overview of some of the commands tools and diagnostic tests that are used for ident...

Page 274: ...displays the faulty HFIs and lsnwdown a displays the network hardware which is in any state other than UP_OPERATIONAL For more information about the lsnwdownhw command see the man pages The lsnwlinki...

Page 275: ...device and reinstall the cable at FR006 CG04 SN051 DR1 HB1 D11 9 Because the wrap test passed on both ports the problem is most likely a missing poorly seated or defective cable 10 Rerun lsnwlinkinfo...

Page 276: ...moved to FR004 CG07 SN010 DR0 HB4 D4 After the mis cables are corrected rerun the lsnwmiswire command to check that the links are no longer reported as DOWN_MISWIRED 4 3 3 SMS ping test fails over HF...

Page 277: ...d service node Are any file systems full on the EMS or service node Do rpower and lshwconn return the expected values Are the directories in etc exports mountable 4 3 5 Other HFI issues In rare instan...

Page 278: ...264 IBM Power Systems 775 for AIX and Linux HPC Solution...

Page 279: ...ceability This chapter describes topics that are related to IBM Power Systems 775 maintenance and serviceability This chapter includes the following topics Managing service updates Power 775 xCAT star...

Page 280: ...vice packs at this website http www ibm com developerworks wikis display hpccentral IBM High Performance Computing Clusters Service Packs IBMHighPerformanceComputingClustersServicePac ks IBMHPCCluster...

Page 281: ...command to determine the current level and the rflash command to update the code level then validate that the update is successful Complete the following command sequence to apply the firmware update...

Page 282: ...less If you want a stateful node you must use an NIM root resource If you want a stateless node you must use an NIM shared_root resource Stateful node A stateful node is a node that maintains its stat...

Page 283: ...ftware and the names of the software to install The command uses the location of the lpp_resource to determine where to find any defined rpm installp or emgr packages The default value for installp_fl...

Page 284: ...ection describes how AIX diskless nodes are updated by using xCAT and AIX NIM commands The section describes the process to switch the node to a different image or update the current image The followi...

Page 285: ...mand uses only these values The xCAT osimage definition is not used or updated when you install software into a SPOT by using the following command mknimimage u my61dskls installp_bundle mybndlres1 my...

Page 286: ...CAT software to communicate directly to the service processor of the Power system without the need for the HMC management Frame node This node features the hwtype set to frame which represents a high...

Page 287: ...wer 775 cluster interrelationships and dependencies in the hardware and software architecture require that the startup is performed in a specific order In this publication we explain these relationshi...

Page 288: ...is a gray area in this process as some network switch hardware is part of the HPC cluster and others might be outside the cluster For this discussion we make the assumption that all network switches t...

Page 289: ...ack is verified the cluster startup is complete Other node types that each customer define to meet their own specific needs might be needed Some examples are nodes responsible for areas such as login...

Page 290: ...he primary EMS and the HMCs The backup EMS is started after the cold start is complete and the cluster is operational The backup EMS is not needed for the cluster start and spending time to start it t...

Page 291: ...trator also verifies that the daemons are running by using the Linux service command that works with status attribute as shown in Example 5 5 Example 5 5 Verifying the daemons are running lsxcatd a ve...

Page 292: ...tandby It takes approximately 5 minutes for the frame BPAs to be placed in the standby state The administrator executes the rpower frame state to make sure the frame status is set as Both BPAs at stan...

Page 293: ...d the CECs FSPs are booted we bring the CEC to on standby state This state powers on the CECs but does not autostart the power to the LPARs This configuration is required because we need to coordinate...

Page 294: ...or executes the rpower command by using the service node group to power on all of the xCAT service nodes as shown in Example 5 15 Example 5 15 Powering on all of the xCAT service nodes rpower service...

Page 295: ...status of the service nodes by using the following rpower command rpower storage state verify that the storage node state is successful The administrator needs to reference the GPFS documentation to...

Page 296: ...e sections describe the general hardware roles and interdependencies that affect the start and shut down of the cluster The examples in this publication are for an AIX environment All commands assume...

Page 297: ...utdown is a scheduled shutdown with sufficient time for the jobs to complete then draining the jobs is the best practice A shutdown that does not allow for all of the jobs to complete requires that th...

Page 298: ...e jobs in a service node xdsh f01sv01 v llstatus Stopping LoadLeveler Shutting down LoadLeveler early in the process reduces any chances of job submission and eliminates any LoadLeveler dependencies o...

Page 299: ...ting the connection state for the BPAs and FSPs and listing the CEC link status Verify that CNM successfully contacted all BPAs and FSPs by issuing the command that is shown in Example 5 25 Example 5...

Page 300: ...down as shown in Example 5 32 Example 5 32 Shutting down the storage nodes xdsh storage v shutdown h now The command that is shown in Example 5 33 verifies that the storage nodes are shut down Exampl...

Page 301: ...hey are ready for power off Manually turn of the red switch for each frame Shutting down EMS and HMCs After all of the nodes and CECs are down and the frames are in rack standby the EMS and HMCs are s...

Page 302: ...off the external disks that are attached to the EMS After the primary and backup EMS are shutdown you turn off the external disk drives Optional Turning off breakers or disconnecting power Now that a...

Page 303: ...php title XCAT_Power_775_Hardwar e_Management Define_the_LPAR_Nodes_and_Create_the_Service 2FUtility_LPARs Hardware nodes In hardware roles the following types of nodes are included in the Power 755...

Page 304: ...54e64a mgt bpa mtm 78AC 100 nodetype ppc parent frame14 postbootscripts otherpkgs postscripts syslog aixremoteshell syncfiles serial 992003L side A 0 Example 5 41 includes the following attribute mean...

Page 305: ...ks wikis display hpccentral IBM HPC Clustering w ith Power 775 Recommended Installation Sequence Version 1 0 IBMHPCClusteringw ithPower775RecommendedInstallationSequence Version1 0 ISNMInstallation FS...

Page 306: ...SP which is retrieved from lsslp hidden Set to 1 means that xCAT hides the node by default in nodels and lsdef output FSP nodes are hidden because you use only the CEC nodes for management To see the...

Page 307: ...information of an I O node Example 5 47 The topology info of an IO node topsummary gpfslog pdisktopology out P7IH DE enclosures found 000DE22 Enclosure 000DE22 Enclosure 000DE22 STOR P1 C4 P1 C5 sees...

Page 308: ...nclosure 000DE22 STOR P1 C84 P1 C85 sees both portcards P1 C84 P1 C85 Portcard P1 C84 ses18 0154 mpt2sas0 24 diskset 33798 ses19 0154 mpt2sas0 24 diskset 40494 Portcard P1 C85 ses26 0154 mpt2sas2 mpt2...

Page 309: ...od es_on_System_P There are several ways to create xCAT cluster node definitions Use the following method that best fits your situation Using the mkdef command You use the xCAT mkdef command to create...

Page 310: ...grouptype static members node01 node02 node03 You then manage the nodes by groups with xCAT commands 5 3 3 Removing nodes from a cluster This section describes the process of removing the nodes from...

Page 311: ...that continues to run the system operation Although the A event does not require a service action such as hardware replacement some administrative actions must be performed The failed hardware remain...

Page 312: ...e degrades over time with an A resources failing For more information see Figure 5 2 on page 307 5 4 3 A resources in a Power 775 Cluster The following resources in the CEC Cluster are used for A Syst...

Page 313: ...es For more information see this website https sourceforge net apps mediawiki xcat index php title Cluster_Recovery An A resource is defined when the following conditions occur The FRU location includ...

Page 314: ...er user to gather problem data or recover from failures Definition Description A Fail in Place Component All A features including Octants and fiber optic interfaces A Fail in Place Event A failure eve...

Page 315: ...1 10 39 Start gatherfip 10 07 2011 10 39 gatherfip Version 1 0 10 07 2011 10 39 xCAT Executive Management Server hostname is c250mgrs40 itso 10 07 2011 10 39 Writing hardware service alerts in TEAL to...

Page 316: ...ween two nodes rinv rinv FRAMENAME firm rinv CECNAME firm Collects firmware information rpower rpower FRAMENAME stat Shows system status lsdef lsdef FRAMENAME i hcp id mtm serial lsdef CECNAME i hcp i...

Page 317: ...0220 0 0 29 569 U78A9 001 1122233 P1 C1 0x21010239 0 0 29 568 U78A9 001 1122233 P1 C2 0x21010238 0 0 29 561 U78A9 001 1122233 P1 C3 0x21010231 0 0 29 560 U78A9 001 1122233 P1 C4 0x21010230 0 0 cec12 P...

Page 318: ...ation of the failure Record the location code for example U787A 001 1234567 On the EMS use xCAT to cross reference the unit location to frame and drawer Use xCAT procedure to cross reference the unit...

Page 319: ...33 36 6 U P1 R 37 40 7 U P1 R1 0 U P1 R2 1 U P1 R3 2 U P1 R4 3 U P1 R5 4 U P1 R6 5 U P1 R7 6 U P1 R8 7 Failed resource Compute node Non compute node QCM Availability Plus Recovering a Compute node Ava...

Page 320: ...the first approach to move resources within the CEC drawer If a failure in a non compute QCM that is used for GPFS the functions and possibly the associated PCI slots must be moved to a fully functio...

Page 321: ...ility 307 Figure 5 2 Original QCM layout Figure 5 3 on page 308 shows that a non compute QCM0 use for GPFS fails The QCM0 functionality is moved to QCM1 QCM0 is redefined as a compute QCM and resides...

Page 322: ...nd Linux HPC Solution Figure 5 3 QCM0 to QCM1 move If a failure occurs in the GPFS QCM1 QCM1 is moved to QCM2 and QCM1 is defined as a compute QCM again and resides in the defective A resource region...

Page 323: ...a CEC move is arranged The alternate CEC locations are defined by IBM A complete CEC is moved to a different CEC which includes the PCI cards that are used by the other CEC This move also requires th...

Page 324: ...torage node features SAS adapters that are assigned to it If the GPFS storage node is still operational ensure that there is an operational backup node or nodes for the set of disks that it owns befor...

Page 325: ...more information about performing these tasks see the Power Systems High performance clustering using the 9125 F2C Service Guide at this website https www ibm com developerworks wikis download attachm...

Page 326: ...312 IBM Power Systems 775 for AIX and Linux HPC Solution...

Page 327: ...n the Service Focal Point in the Hardware Management Console HMC When such error data is provided the data is uploaded automatically from the HMC to the IBM system or the data is manually collected In...

Page 328: ...the call and works with the data A serviceable event on the HMC shows details about the involved hardware Field Replaceable Units FRUs time date error signature and so on In the manage serviceable eve...

Page 329: ...plicable Return Code 0x0000B2DF Action Flags Report to Operating System Service Action Required HMC Call Home Primary System Reference Code Section Version 1 Sub section type 1 Created by hfug SRC For...

Page 330: ...al Number YH10HA112345 Normal Hardware FRU Priority Medium Priority Location Code U78A9 001 9912345 P1 R11 Part Number 41U9500 CCIN 2E00 Serial Number YH10HA112345 Normal Hardware FRU Priority Medium...

Page 331: ...14 42 36 SMGR Next State SMGR_HOST_STARTED SMGR Next State Timestamp 10 11 2011 14 42 36 IPLP Diagnostic Level IPLP_DIAGNOSTICS_NORMAL User Defined Data Section Version 0 Sub section type 0 Created by...

Page 332: ...108784C4 00000210 00000000 00000000 00000000 00000000 00000220 00000000 01000000 10010101 01010101 00000230 01010101 01010101 01010101 01010101 00000240 01010101 01010101 01000000 00000000 00000250 0...

Page 333: ...ystem Management Interface ASMI of that specific node it is possible to view the deconfigured resources The xCAT command rinv NODENAME deconfig also is available to view the deconfigured resources Fig...

Page 334: ...h for it in the Information Center The procedure guides you to Power Systems Information POWER7 Systems 9125 F2C Troubleshooting service and support Isolation Procedures Service Processor Isolation Pr...

Page 335: ...resource from the current compute or non compute node configuration The resource is moved into the failed resources group For more information about the A procedures see 5 4 Power 775 Availability Plu...

Page 336: ...tems 775 for AIX and Linux HPC Solution The complete service path is in the Cluster Service Guide at this website https www ibm com developerworks wikis download attachments 162267485 p775_se rvice_gu...

Page 337: ...M Corp 2012 All rights reserved 323 Appendix B Command outputs This appendix provides long command output examples The following topics are described in this appendix General Parallel File System nati...

Page 338: ...C76 P1 C77 P1 C101 D3 2SS6 5000c5001ce45dc5 5000c5001ce45dc6 hdisk667 hdisk763 0 dev rhdisk667 dev rhdisk763 140A0C0D4EA0773E c056d2 000DE22B OT 140A0C0D4EA07897 DA2 naa 5000C5001CE339BF U78AD 001 000...

Page 339: ...9 04 00 00 IBM 7 8AD 001 0154 YH10UE13G022 74Y02380 7890011122233191 000DE22 P1 C29 DB2 An output from the command db2 get database configuration for DB2_instance is shown in Example B 2 Example B 2 C...

Page 340: ...e size 4KB CATALOGCACHE_SZ 50 Log buffer size 4KB LOGBUFSZ 98 Utilities heap size 4KB UTIL_HEAP_SZ 50193 Buffer pool size pages BUFFPAGE 1000 SQL statement heap 4KB STMTHEAP AUTOMATIC 8192 Default app...

Page 341: ...n mode HADR_SYNCMODE NEARSYNC HADR peer window duration seconds HADR_PEER_WINDOW 0 First log archive method LOGARCHMETH1 OFF Options for logarchmeth1 LOGARCHOPT1 Second log archive method LOGARCHMETH2...

Page 342: ...rk events MON_UOW_DATA NONE Lock timeout events MON_LOCKTIMEOUT NONE Deadlock events MON_DEADLOCK WITHOUT_HIST Lock wait events MON_LOCKWAIT NONE Lock wait event threshold MON_LW_THRESH 5000000 Number...

Page 343: ...1 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN1...

Page 344: ...D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3...

Page 345: ...12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN...

Page 346: ...9 SN6 D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN1...

Page 347: ...SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11...

Page 348: ...1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D11 SN4 D12 SN3 D13 SN2 D14 SN1 D15 SN0 D0 SN15 D1 SN14 D2 SN13 D3 SN12 D4 SN11 D5 SN10 06 SN9 D7 SN8 D8 SN7 D9 SN6 D10 SN5 D...

Page 349: ...for Linux on POWER V5 1Guide and Reference SA23 2268 01 ESSL for AIX V5 1 Installation Guide GA32 0767 00 ESSL for Linux on POWER V5 1 Installation Guide GA32 0768 00 Parallel ESSL for AIX V4 1 Instal...

Page 350: ...Power Systems 775 for AIX and Linux HPC Solution Parallel Tools Platform http www eclipse org ptp Help from IBM IBM Support and downloads http www ibm com support IBM Global Services http www ibm com...

Page 351: ...s 163 208 hmcshutdown t now 287 hpccount 142 ifhfi_dump 163 installp 269 inutoc 271 iostat 194 196 isql v 162 istat 194 llcancel 153 284 llclass 160 llconfig 232 234 llctl 232 llctl ckconfig 232 llfs...

Page 352: ...267 rflash frame p activate disruptive 267 rinv 267 302 rinv frame firm 267 rinv NODENAME deconfig 319 rmdef t node noderange 297 rmdsklsnode V f 296 rmhwconn cec 228 rmhwconn frame 228 rnetboot 248 r...

Page 353: ...tetree 101 Local links LL links 2 Local Network Management Controller LNMC 64 Local Network Manager LNMC 67 Local remote links LR links 2 Log var adm ras errlog 102 var adm ras gpfslog 102 var log mes...

Page 354: ...340 IBM Power Systems 775 for AIX and Linux HPC Solution V Vector Multimedia eXtension VMX 106 Vector Scalar eXtension VSX 106 X xCAT DFM 272...

Page 355: ...51 pages 1 0 spine 0 875 1 498 460 788 pages 0 5 spine 0 475 0 873 250 459 pages IBM Power Systems 775 for AIX and Linux HPC Solution IBM Power Systems 775 for AIX and Linux HPC Solution IBM Power Sys...

Page 356: ...2 0 spine 2 0 2 498 1052 1314 pages 2 5 spine 2 5 nnn n 1315 nnnn pages IBM Power Systems 775 for AIX and Linux HPC Solution IBM Power Systems 775 for AIX and Linux HPC Solution...

Page 357: ......

Page 358: ...r AIX and Linux HPC Solution Unleashes computing power for HPC workloads Providesarchitectural solution overview Contains sample scenarios This IBM Redbooks publication contains information about the...

Reviews: