background image

Additional system parts (8335-GTW or 8335-GTX water-cooled system with 4 GPUs)

Figure 7. Additional system parts (8335-GTW or 8335-GTX water-cooled system with 4 GPUs)
38  Power Systems: Problem analysis, system parts, and locations for the 8335-GTC, 8335-GTG, 8335-GTH,

8335-GTW, and 8335-GTX

Содержание 8335-GTG

Страница 1: ...Power Systems Problem analysis system parts and locations for the 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX IBM ...

Страница 2: ...l G229 9054 and the IBM Environmental Notices and User Guide Z125 5823 This edition applies to IBM Power Systems servers that contain the POWER9 processor and to all associated models Copyright International Business Machines Corporation 2017 2019 US Government Users Restricted Rights Use duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp ...

Страница 3: ...ocation of the GPU 16 Identifying the location of the NVMe Flash adapter 16 Identifying the location of the storage device 17 User guides for GPUs and PCIe adapters 17 Resolving an over temperature problem for a water cooled 8335 GTW or 8335 GTX system 18 Determining and setting the thermal mode for an 8335 GTG 8335 GTH or 8335 GTX system 20 Identifying a service action by using system event logs ...

Страница 4: ...iv ...

Страница 5: ...800 300 8751 German safety information Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne 2 der Bildschirmarbeitsverordnung geeignet Laser safety information IBM servers can use I O cards or features that are fiber optic based and that utilize lasers or LEDs Laser compliance IBM servers may be installed inside or outside of an IT equipment rack DANGER When working on or aro...

Страница 6: ... unless instructed otherwise 2 For AC power remove the power cords from the outlets 3 For racks with a DC power distribution panel PDP turn off the circuit breakers located in the PDP and remove the power from the Customer s DC power source 4 Remove the signal cables from the connectors 5 Remove all cables from the devices To Connect 1 Turn off everything unless instructed otherwise 2 Attach all c...

Страница 7: ... devices that attach to the system It is the responsibility of the customer to ensure that the outlet is correctly wired and grounded to prevent an electrical shock R001 part 1 of 2 R001 part 2 of 2 CAUTION Do not install a unit in a rack where the internal rack ambient temperatures will exceed the manufacturer s recommended ambient temperature for all your rack mounted devices Do not install a un...

Страница 8: ...hoose can support the weight of the loaded rack cabinet Refer to the documentation that comes with your rack cabinet for the weight of a loaded rack cabinet Verify that all door openings are at least 760 x 230 mm 30 x 80 in Ensure that all devices shelves drawers doors and cables are secure Ensure that the four leveling pads are raised to their highest position Ensure that there is no stabilizer b...

Страница 9: ...y position for example when working from a ladder Stability hazard The rack may tip over causing serious personal injury Before extending the rack to the installation position read the installation instructions Do not put any load on the slide rail mounted equipment mounted in the installation position Do not leave the slide rail mounted equipment in the installation position L002 L003 or or or Sa...

Страница 10: ... power cords or multiple DC power cables To remove all hazardous voltages disconnect all power cords and power cables L003 L007 CAUTION A hot surface nearby L007 x Power Systems Problem analysis system parts and locations for the 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX ...

Страница 11: ...king into the other end of a disconnected optical fiber to verify the continuity of optic fibers may not injure the eye this procedure is potentially dangerous Therefore verifying the continuity of optical fibers by shining light into one end and looking at the other end is not recommended To verify continuity of a fiber optic cable use an optical light source and power meter C027 CAUTION This pro...

Страница 12: ...iable force so take care not to push or lean Keep riser tilt adjustable angling platform option flat at all times except for final minor angle adjustment when needed Do not stand under overhanging load Do not use on uneven surface incline or decline major ramps Do not stack loads Do not operate while under the influence of drugs or alcohol Do not support ladder against LIFT TOOL unless the specifi...

Страница 13: ...uitable for connection to intrabuilding or unexposed wiring or cabling only The intrabuilding ports of this equipment must not be metallically connected to the interfaces that connect to the OSP outside plant or its wiring These interfaces are designed for use as intrabuilding interfaces only Type 2 or Type 4 ports as described in GR 1089 CORE and require isolation from the exposed OSP cabling The...

Страница 14: ...xiv Power Systems Problem analysis system parts and locations for the 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX ...

Страница 15: ...m occurred Go to Resolving a BMC access problem on page 2 The system does not power on the power button or the BMC power on command does not power on the system Go to Resolving a power problem on page 3 A system firmware boot failure occurred the system started but was not able to boot to the Petitboot menu Go to Resolving a system firmware boot failure on page 4 A video graphics array VGA monitor...

Страница 16: ... This ends the procedure Resolving a BMC access problem Learn how to identify the service action that is needed to resolve a baseboard management controller BMC access problem Procedure 1 Ensure that the BMC password is not set to the default password For information about changing the default password see Logging on to the OpenBMC GUI Does the problem persist If Then Yes Continue with the next st...

Страница 17: ...the procedure No Continue with the next step 6 Complete the service action that is indicated for your system If your system is an 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX replace the BMC card Go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on page 25 to identify the physical location and the removal and replacement procedure This ends the procedure Resolving a power problem ...

Страница 18: ...ed a Resolve any serviceable alerts that are in the event log Go to Resolving a hardware problem b Ensure that the power supply is fully seated in the system c Ensure that the power supply fan is not blocked d Replace the power supply If your system is an 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on page 25 to identify the physi...

Страница 19: ...n and the removal and replacement procedure This ends the procedure 6 Power off the system and disconnect all AC power cords for 30 seconds Then reconnect the AC power cords and power on the system Does the system boot successfully If Then Yes This ends the procedure No Go to Resolving a hardware problem on page 7 This ends the procedure Resolving a VGA monitor problem Learn how to identify the se...

Страница 20: ...on page 6 3 Complete the following actions one at a time until the problem is resolved a Ensure that a problem does not exist with the connection to the network location b Ensure that the adapter has a valid IP address for the network c Replace the network adapter If your system is an 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on...

Страница 21: ...cting diagnostic data on page 23 Then go to Contacting IBM service and support on page 24 This ends the procedure 7 Did the service action fix the problem If Then Yes This ends the procedure No Go to Collecting diagnostic data on page 23 Then go to Contacting IBM service and support on page 24 This ends the procedure Resolving a GPU PCIe adapter or device problem Learn how to access log files info...

Страница 22: ...ce action Type of problem Service procedure eth1 eth2 eth3 enPxxxxx where xxxxx indicates the network port Failed to re initialize device Network Go to Resolving a network adapter problem on page 9 mlx5_core Link Down health_care handling bad device here Network Go to Resolving a network adapter problem on page 9 tg3 PCI I O error detected Link is Down Network Go to Resolving a network adapter pro...

Страница 23: ... system log entry Resolving a network adapter problem Learn about the possible problems and service actions that you can perform to resolve a network adapter problem About this task Note To determine the location of the PCIe adapter see Identifying the location of the PCIe adapter by using the slot number on page 15 Table 2 Network adapter problems and service actions Problem Service action System...

Страница 24: ...indicator light on the adapter is off 1 Verify that the cable functions properly by testing it with a known working connection 2 Verify that the port or ports on the switch are enabled and functional 3 Verify that the switch and adapter are compatible 4 Replace the adapter Link light on the adapter is on but there is no communication from the adapter 1 Verify that the most recent driver is install...

Страница 25: ...s installed on the system Otherwise install the most recent firmware if it is not already installed 4 Restart the system 5 If the GPU is still missing replace the following items one at a time until the problem is resolved Note Go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on page 25 to identify the physical location and the removal and replacement procedure a GPU b System proces...

Страница 26: ...TX locations on page 25 to identify the physical location and the removal and replacement procedure a CPU 0 b GPU 2 c GPU 1 d GPU 0 e System backplane This ends the procedure 4 Does NPU chip 1 appear in the fence error log entry Yes Continue with the next step No Go to Contacting IBM service and support on page 24 This ends the procedure 5 Replace the following items one at a time until the proble...

Страница 27: ... GPUs and PCIe adapters on page 17 Resolving a storage device problem Learn about the possible problems and service actions that you can perform to resolve a storage device problem About this task Note To determine the location of the storage device see Identifying the location of the storage device on page 17 Table 4 Storage device problems and service actions Problem Service action System unable...

Страница 28: ...h adapter 1 If the NVMe Flash adapter has an amber LED that is flashing or is on solid replace the adapter Go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on page 25 to identify the physical location and removal and replacement procedure Important Before you remove an NVMe Flash adapter ensure that you back up all data on the adapter or the array that contains the adapter After you...

Страница 29: ...meX is the resource name of the NVMe Flash adapter Then test the NVMe Flash adapter again 2 Ensure that the latest I O adapter firmware is installed For instructions see Getting firmware fixes for IBM I O adapters by using Fix Central 3 Ensure that you have the latest device driver service updates by installing the latest Linux distribution fixes 4 Type the following command and press Enter nvme s...

Страница 30: ...e error message b Log in to the operating system with root authority c Type the following command and press Enter lshw class display d Determine the GPU slot that is associated with the PCI bus information that you recorded in step a e Replace the GPU Go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on page 25 and use the slot number information to identify the physical location and...

Страница 31: ...r solid state drive If your system is an 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on page 25 to identify the removal and replacement procedure This ends the procedure 3 The storage device location is determined in the drive removal and replacement procedures for your system Use the following table to find the correct removal an...

Страница 32: ...o Continue with the next step 4 Ensure that the following requirements are met a The quick connects between the 8335 GTW or 8335 GTX system and the water manifold are mated and connected to the proper circuits of the manifold The supply hose must be connected to the supply manifold circuit which is the manifold circuit that is located toward the inside of the rack The return hose must be connected...

Страница 33: ...Replacing a system processor module in an 8335 GTW or 8335 GTX system and complete the steps for removing and installing a new TIM pad This ends the procedure No Continue with the next step 7 Is a GPU over heating but the other GPUs and the processors are not over heating If Then Yes Replace the thermal interface material TIM between the cold plate and the GPU that is over heating Go to Removing t...

Страница 34: ...stem the thermal mode setting is lost and must be reapplied Table 8 Thermal mode setting for the 8335 GTG or 8335 GTH system Adapter feature code Adapter description Cable type Thermal mode EC62 PCIe4 x16 1 Port EDR 100 Gb InfiniBand ConnectX 5 CAPI capable adapter Copper DEFAULT Optical CUSTOM EC64 PCIe4 x16 2 Port EDR 100 Gb InfiniBand ConnectX 5 CAPI capable adapter Copper DEFAULT Optical CUSTO...

Страница 35: ... thermal mode to HEAVY_IO type the following command and press Enter openbmctool U username P password H BMC IP address or BMC host name thermal modes set m HEAVY_IO z 0 To set the thermal mode to MAX_BASE_FAN_FLOOR type the following command and press Enter openbmctool U username P password H BMC IP address or BMC host name thermal modes set m MAX_BASE_FAN_FLOOR z 0 Identifying a service action b...

Страница 36: ...rifying a repair Learn how to verify hardware operation after you make repairs to the system Procedure 1 Power on the system 2 Did you replace a graphics processing unit GPU PCIe adapter disk drive or solid state drive If Then Yes Go to step 5 on page 22 No Continue with the next step 3 Scan the system event logs SELs for serviceable events that occurred after system hardware was replaced For info...

Страница 37: ...mi L at the command prompt of the operating system and press Enter Verify that the GPU is listed b Type nvidia smi q at the command prompt of the operating system and press Enter Verify that no errors are listed Network adapter Complete the following steps a At the command prompt of the operating system type ethtool ethx where x is the number of the physical port that you are testing Verify that t...

Страница 38: ...m or if you are directed to contact support go to Collecting diagnostic data on page 23 Then use the information below to contact IBM service and support Customers in the United States United States territories or Canada can place a hardware service request online To place a hardware service request online go to the IBM Support Portal http www ibm com support entry portal product power scale out_l...

Страница 39: ...ters which begin with 1 Rack views The following diagrams show field replaceable unit FRU layouts in the system Use these diagrams with the following tables Rear view Figure 1 Front view Table 12 Front view locations Index number FRU description FRU removal and replacement procedures 1 Fan 0 See Removing and replacing fans in the 8335 GTC 8335 GTG or 8335 GTH or Removing and replacing fans in the ...

Страница 40: ...lacing the power switch and cable in the 8335 GTW Removing and replacing the power switch and cable in the 8335 GTX 8 USB cable and connector Note 8335 GTC and 8335 GTW systems do not support this location See Removing and replacing the USB cable and connector in the 8335 GTG or 8335 GTH or Removing and replacing the USB cable and connector in the 8335 GTX Figure 2 Top view 26 Power Systems Proble...

Страница 41: ...replacing the system backplane in the 8335 GTG or 8335 GTH Removing and replacing the system backplane in the 8335 GTW Removing and replacing the system backplane in the 8335 GTX 12 CPU 0 See Removing and replacing a system processor module in the 8335 GTC 8335 GTG or 8335 GTH or Removing and replacing a system processor module in the 8335 GTW or 8335 GTX 13 CPU 1 14 GPU 0 See Removing and replaci...

Страница 42: ...cedures 22 PSU 0 See Removing and replacing a power supply in the 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX 23 PSU 1 24 PCIe adapter 1 See Removing and replacing PCIe adapters in the 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX 25 PCIe adapter 2 26 PCIe adapter 3 27 PCIe adapter 4 28 Baseboard management controller BMC card See Removing and replacing the BMC card in the 8335 GTC 8335 GTG ...

Страница 43: ...Figure 3 Memory locations The following table provides the memory locations Finding parts and locations 29 ...

Страница 44: ...8335 GTG or 8335 GTH or Removing and replacing memory in the 8335 GTW or 8335 GTX 30 DIMM 1 31 DIMM 2 32 DIMM 3 33 DIMM 4 34 DIMM 5 35 DIMM 6 36 DIMM 7 37 DIMM 8 38 DIMM 9 39 DIMM 10 40 DIMM 11 41 DIMM 12 42 DIMM 13 43 DIMM 14 44 DIMM 15 30 Power Systems Problem analysis system parts and locations for the 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX ...

Страница 45: ...attaching screws 8335 GTC 8335 GTG 8335 GTH or 8335 GTX 1 01EM209 1 Fixed rail kit contains left and right fixed rails and attaching screws 8335 GTW 2 00E4260 1 Slide rail kit contains left and right slide rails and attaching screws 8335 GTC 8335 GTG or 8335 GTH 3 74Y9063 1 Cable management arm assembly 8335 GTC 8335 GTG or 8335 GTH Note This part can only be used with a slide rail kit 4 45W8836 1...

Страница 46: ...and attaching screws 8335 GTW 5 00E4260 1 Slide rail kit contains left and right slide rails and attaching screws 8335 GTC or 8335 GTG 6 00E7329 1 Electronic Industries Association EIA bracket right side 7 02CL350 1 Bezel 8 00E7328 1 EIA bracket left side 32 Power Systems Problem analysis system parts and locations for the 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX ...

Страница 47: ...System parts Figure 5 System parts Finding parts and locations 33 ...

Страница 48: ... power switch cable 8335 GTC or 8335 GTW 10 1 Screw 11 00E4252 0 2 Drive filler 11 00LY460 0 2 960 GB solid state drive 11 00LY461 0 2 1 92 TB solid state drive 11 00LY462 0 2 3 84 TB solid state drive 12 01EM065 3 4 Fan Note 8335 GTC 8335 GTG or 8335 GTH systems have four fans 8335 GTW or 8335 GTX systems have three fans 13 01NN923 1 Disk drive and fan card 14 78P4191 16 8 GB 2666 Mbps DDR4 RDIMM...

Страница 49: ...Additional system parts 8335 GTC 8335 GTG or 8335 GTH air cooled system Figure 6 Additional system parts 8335 GTC 8335 GTG or 8335 GTH air cooled system Finding parts and locations 35 ...

Страница 50: ...em processor module processor tray 4 mm hex driver module replacement tool and air pump 8335 GTC or 8335 GTG 02CL564 2 DD2 2 16 core 2 7 GHz system processor module kit includes system processor module processor tray 4 mm hex driver module replacement tool and air pump 8335 GTH 02CM214 2 DD2 3 16 core 2 7 GHz system processor module kit includes system processor module processor tray 4 mm hex driv...

Страница 51: ...erent speeds or differing numbers of cores To determine the DDx y level type the following command and press Enter openbmctool U username P password H BMC IP address or BMC host name fru print The DDx y level is the CPU version number in the format xy For example a CPU with version number 23 is DD2 3 Finding parts and locations 37 ...

Страница 52: ...5 GTX water cooled system with 4 GPUs Figure 7 Additional system parts 8335 GTW or 8335 GTX water cooled system with 4 GPUs 38 Power Systems Problem analysis system parts and locations for the 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX ...

Страница 53: ...ld plate assembly for systems with 4 GPUs includes cold plates tweezers 4 mm hex driver and installation tool 28 01KL428 2 Pipe holder 1 8335 GTW or 8335 GTX 29 01KL429 2 Pipe holder 2 8335 GTW or 8335 GTX 30 01EM006 1 Baseboard management controller BMC card air baffle 8335 GTW or 8335 GTX 31 02AU282 1 BMC card 8335 GTW 02PX051 1 BMC card 8335 GTX 32 01EM325 1 System backplane kit for systems wit...

Страница 54: ...processor tray 4 mm hex driver module replacement tool and air pump 8335 GTW 02CL567 2 DD2 2 22 core 2 8 GHz system processor module kit includes system processor module processor tray 4 mm hex driver module replacement tool and air pump 8335 GTX 02CM217 2 DD2 3 22 core 2 8 GHz system processor module kit includes system processor module processor tray 4 mm hex driver module replacement tool and a...

Страница 55: ...Additional system parts 8335 GTW or 8335 GTX water cooled system with 6 GPUs Figure 8 Additional system parts 8335 GTW or 8335 GTX water cooled system with 6 GPUs Finding parts and locations 41 ...

Страница 56: ...late assembly for systems with 6 GPUs includes cold plates tweezers 4 mm hex driver and installation tool 8335 GTW or 8335 GTX 38 01KL429 2 Pipe bracket 2 8335 GTW 8335 GTX 39 01EM006 1 Baseboard management controller BMC card air baffle 8335 GTW or 8335 GTX 40 02AU282 1 BMC card 8335 GTW 02PX051 1 BMC card 8335 GTX 41 01EM304 1 System backplane kit for systems with 6 GPUs includes system backplan...

Страница 57: ...includes system processor module processor tray 4 mm hex driver module replacement tool and air pump 8335 GTX 02CM217 2 DD2 3 22 core 2 8 GHz system processor module kit includes system processor module processor tray 4 mm hex driver module replacement tool and air pump 8335 GTX 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX systems do not support mixing system processors with different DDx y le...

Страница 58: ...screw for the empty GPU slot There are a total number of three screws in this kit 01EM312 1 System screw kit includes screws for the system backplane BMC card and disk drive and fan card 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX 01EM303 1 External USB DVD drive 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX 01LU635 1 GPU air baffle 8335 GTW or 8335 GTX water cooled system with 4 GPUs 01EM00...

Страница 59: ...uct s and or the program s described in this publication at any time without notice Any references in this information to non IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk IBM may use or distribute any o...

Страница 60: ...tent successfully Overview The IBM Power Systems servers include the following major accessibility features Keyboard only operation Operations that use a screen reader The IBM Power Systems servers use the latest W3C Standard WAI ARIA 1 0 www w3 org TR wai aria to ensure compliance with US Section 508 www access board gov guidelines and standards communications and it about the section 508 standar...

Страница 61: ...fiable information If the configurations deployed for this Software Offering provide you as the customer the ability to collect personally identifiable information from end users via cookies and other technologies you should seek your own legal advice about any laws applicable to such data collection including any requirements for notice and consent For more information about the use of various te...

Страница 62: ... device complies with Part 15 of the FCC rules Operation is subject to the following two conditions 1 this device may not cause harmful interference and 2 this device must accept any interference received including interference that may cause undesired operation Industry Canada Compliance Statement CAN ICES 3 A NMB 3 A European Community Compliance Statement This product is in conformity with the ...

Страница 63: ...nt explains the JEITA statement for products greater than 20 A single phase This statement explains the JEITA statement for products greater than 20 A per phase three phase Electromagnetic Interference EMI Statement People s Republic of China Declaration This is a Class A product In a domestic environment this product may cause radio interference in which case the user may need to perform practica...

Страница 64: ...hinweis versehen werden Warnung Dieses ist eine Einrichtung der Klasse A Diese Einrichtung kann im Wohnbereich Funk Störungen verursachen in diesem Fall kann vom Betreiber verlangt werden angemessene Maßnahmen zu ergreifen und dafür aufzukommen Deutschland Einhaltung des Gesetzes über die elektromagnetische Verträglichkeit von Geräten Dieses Produkt entspricht dem Gesetz über die elektromagnetisch...

Страница 65: ...ipment into an outlet on a circuit different from that to which the receiver is connected Consult an IBM authorized dealer or service representative for help Properly shielded and grounded cables and connectors must be used in order to meet FCC emission limits Proper cables and connectors are available from IBM authorized dealers IBM is not responsible for any radio or television interference caus...

Страница 66: ...ins the Japan Electronics and Information Technology Industries Association JEITA statement for products less than or equal to 20 A per phase This statement explains the JEITA statement for products greater than 20 A single phase This statement explains the JEITA statement for products greater than 20 A per phase three phase 52 Power Systems Problem analysis system parts and locations for the 8335...

Страница 67: ...esrepublik Deutschland Zulassungsbescheinigung laut dem Deutschen Gesetz über die elektromagnetische Verträglichkeit von Geräten EMVG bzw der EMC Richtlinie 2014 30 EU für Geräte der Klasse B Dieses Gerät ist berechtigt in Übereinstimmung mit dem Deutschen EMVG das EG Konformitätszeichen CE zu führen Verantwortlich für die Einhaltung der EMV Vorschriften ist der Hersteller International Business M...

Страница 68: ...the right to withdraw the permissions granted herein whenever in its discretion the use of the publications is detrimental to its interest or as determined by IBM the above instructions are not being properly followed You may not download export or re export this information except in full compliance with all applicable laws and regulations including all United States export laws and regulations I...

Страница 69: ......

Страница 70: ...IBM ...

Отзывы: