background image

Identifying the location of the GPU

The error message provides information to help you to determine the location of the graphics processing

unit (GPU).

Procedure

1. Does the operating system log contain the slot number? For example, the log might contain an error

message similar to the following text:

EEH: PHB#0 failure detected, location: GPU1

If

Then

Yes:

Replace the GPU. Go to “8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX

locations” on page 25 and use the slot number information to identify the physical

location and the removal and replacement procedure. This ends the procedure.

No:

Continue with the next step.

2. Does the operating system log contain an error message with only PCI bus information (for example,

0002:01:00.0

)?

If

Then

Yes:

Continue with the next step.

No:

Go to “Collecting diagnostic data” on page 23. Then, go to “Contacting IBM service

and support” on page 24. This ends the procedure.

3. You can determine the GPU slot information by using the 

lshw

 command. To determine the GPU slot,

complete the following steps:

a) Record the PCI bus information that is in the error message.

b) Log in to the operating system with root authority.

c) Type the following command and press Enter:

lshw -class display

d) Determine the GPU slot that is associated with the PCI bus information that you recorded in step a.

e) Replace the GPU. Go to “8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX locations” on

page 25 and use the slot number information to identify the physical location and the removal and

replacement procedure. This ends the procedure.

Identifying the location of the NVMe Flash adapter

Use this procedure to identify the location of a Non-Volatile Memory Express (NVMe) Flash adapter.

Procedure

1. Does the operating system log contain the slot number? For example, the log might contain an error

message similar to the following text:

[131779.752714] EEH: PHB#0 failure detected, location: Slot1

If

Then

Yes:

Replace the adapter. Go to “8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-

GTX locations” on page 25 and use the slot number information to identify the

physical location and the removal and replacement procedure. This ends the

procedure.

No:

Continue with the next step.

2. Locate the NVMe Flash adapter by using the PCI address:

16  Power Systems: Problem analysis, system parts, and locations for the 8335-GTC, 8335-GTG, 8335-GTH,

8335-GTW, and 8335-GTX

Содержание Power AC922 8335-GTW

Страница 1: ...Power Systems Problem analysis system parts and locations for the 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX IBM...

Страница 2: ...G229 9054 and the IBM Environmental Notices and User Guide Z125 5823 This edition applies to IBM Power Systems servers that contain the POWER9 processor and to all associated models Copyright Interna...

Страница 3: ...cation of the GPU 16 Identifying the location of the NVMe Flash adapter 16 Identifying the location of the storage device 17 User guides for GPUs and PCIe adapters 17 Resolving an over temperature pro...

Страница 4: ...iv...

Страница 5: ...00 300 8751 German safety information Das Produkt ist nicht f r den Einsatz an Bildschirmarbeitspl tzen im Sinne 2 der Bildschirmarbeitsverordnung geeignet Laser safety information IBM servers can use...

Страница 6: ...unless instructed otherwise 2 For AC power remove the power cords from the outlets 3 For racks with a DC power distribution panel PDP turn off the circuit breakers located in the PDP and remove the po...

Страница 7: ...devices that attach to the system It is the responsibility of the customer to ensure that the outlet is correctly wired and grounded to prevent an electrical shock R001 part 1 of 2 R001 part 2 of 2 CA...

Страница 8: ...oose can support the weight of the loaded rack cabinet Refer to the documentation that comes with your rack cabinet for the weight of a loaded rack cabinet Verify that all door openings are at least 7...

Страница 9: ...position for example when working from a ladder Stability hazard The rack may tip over causing serious personal injury Before extending the rack to the installation position read the installation ins...

Страница 10: ...power cords or multiple DC power cables To remove all hazardous voltages disconnect all power cords and power cables L003 L007 CAUTION A hot surface nearby L007 x Power Systems Problem analysis system...

Страница 11: ...ing into the other end of a disconnected optical fiber to verify the continuity of optic fibers may not injure the eye this procedure is potentially dangerous Therefore verifying the continuity of opt...

Страница 12: ...able force so take care not to push or lean Keep riser tilt adjustable angling platform option flat at all times except for final minor angle adjustment when needed Do not stand under overhanging load...

Страница 13: ...itable for connection to intrabuilding or unexposed wiring or cabling only The intrabuilding ports of this equipment must not be metallically connected to the interfaces that connect to the OSP outsid...

Страница 14: ...xiv Power Systems Problem analysis system parts and locations for the 8335 GTC 8335 GTG 8335 GTH 8335 GTW and 8335 GTX...

Страница 15: ...occurred Go to Resolving a BMC access problem on page 2 The system does not power on the power button or the BMC power on command does not power on the system Go to Resolving a power problem on page...

Страница 16: ...This ends the procedure Resolving a BMC access problem Learn how to identify the service action that is needed to resolve a baseboard management controller BMC access problem Procedure 1 Ensure that t...

Страница 17: ...he procedure No Continue with the next step 6 Complete the service action that is indicated for your system If your system is an 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX replace the BMC card Go...

Страница 18: ...d a Resolve any serviceable alerts that are in the event log Go to Resolving a hardware problem b Ensure that the power supply is fully seated in the system c Ensure that the power supply fan is not b...

Страница 19: ...and the removal and replacement procedure This ends the procedure 6 Power off the system and disconnect all AC power cords for 30 seconds Then reconnect the AC power cords and power on the system Doe...

Страница 20: ...n page 6 3 Complete the following actions one at a time until the problem is resolved a Ensure that a problem does not exist with the connection to the network location b Ensure that the adapter has a...

Страница 21: ...ting diagnostic data on page 23 Then go to Contacting IBM service and support on page 24 This ends the procedure 7 Did the service action fix the problem If Then Yes This ends the procedure No Go to C...

Страница 22: ...e action Type of problem Service procedure eth1 eth2 eth3 enPxxxxx where xxxxx indicates the network port Failed to re initialize device Network Go to Resolving a network adapter problem on page 9 mlx...

Страница 23: ...system log entry Resolving a network adapter problem Learn about the possible problems and service actions that you can perform to resolve a network adapter problem About this task Note To determine t...

Страница 24: ...ndicator light on the adapter is off 1 Verify that the cable functions properly by testing it with a known working connection 2 Verify that the port or ports on the switch are enabled and functional 3...

Страница 25: ...installed on the system Otherwise install the most recent firmware if it is not already installed 4 Restart the system 5 If the GPU is still missing replace the following items one at a time until th...

Страница 26: ...X locations on page 25 to identify the physical location and the removal and replacement procedure a CPU 0 b GPU 2 c GPU 1 d GPU 0 e System backplane This ends the procedure 4 Does NPU chip 1 appear i...

Страница 27: ...GPUs and PCIe adapters on page 17 Resolving a storage device problem Learn about the possible problems and service actions that you can perform to resolve a storage device problem About this task Note...

Страница 28: ...adapter 1 If the NVMe Flash adapter has an amber LED that is flashing or is on solid replace the adapter Go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on page 25 to identify the phy...

Страница 29: ...eX is the resource name of the NVMe Flash adapter Then test the NVMe Flash adapter again 2 Ensure that the latest I O adapter firmware is installed For instructions see Getting firmware fixes for IBM...

Страница 30: ...error message b Log in to the operating system with root authority c Type the following command and press Enter lshw class display d Determine the GPU slot that is associated with the PCI bus informa...

Страница 31: ...solid state drive If your system is an 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX go to 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX locations on page 25 to identify the removal and replaceme...

Страница 32: ...Continue with the next step 4 Ensure that the following requirements are met a The quick connects between the 8335 GTW or 8335 GTX system and the water manifold are mated and connected to the proper...

Страница 33: ...eplacing a system processor module in an 8335 GTW or 8335 GTX system and complete the steps for removing and installing a new TIM pad This ends the procedure No Continue with the next step 7 Is a GPU...

Страница 34: ...tem the thermal mode setting is lost and must be reapplied Table 8 Thermal mode setting for the 8335 GTG or 8335 GTH system Adapter feature code Adapter description Cable type Thermal mode EC62 PCIe4...

Страница 35: ...thermal mode to HEAVY_IO type the following command and press Enter openbmctool U username P password H BMC IP address or BMC host name thermal modes set m HEAVY_IO z 0 To set the thermal mode to MAX_...

Страница 36: ...ifying a repair Learn how to verify hardware operation after you make repairs to the system Procedure 1 Power on the system 2 Did you replace a graphics processing unit GPU PCIe adapter disk drive or...

Страница 37: ...i L at the command prompt of the operating system and press Enter Verify that the GPU is listed b Type nvidia smi q at the command prompt of the operating system and press Enter Verify that no errors...

Страница 38: ...or if you are directed to contact support go to Collecting diagnostic data on page 23 Then use the information below to contact IBM service and support Customers in the United States United States te...

Страница 39: ...ers which begin with 1 Rack views The following diagrams show field replaceable unit FRU layouts in the system Use these diagrams with the following tables Rear view Figure 1 Front view Table 12 Front...

Страница 40: ...acing the power switch and cable in the 8335 GTW Removing and replacing the power switch and cable in the 8335 GTX 8 USB cable and connector Note 8335 GTC and 8335 GTW systems do not support this loca...

Страница 41: ...eplacing the system backplane in the 8335 GTG or 8335 GTH Removing and replacing the system backplane in the 8335 GTW Removing and replacing the system backplane in the 8335 GTX 12 CPU 0 See Removing...

Страница 42: ...edures 22 PSU 0 See Removing and replacing a power supply in the 8335 GTC 8335 GTG 8335 GTH 8335 GTW or 8335 GTX 23 PSU 1 24 PCIe adapter 1 See Removing and replacing PCIe adapters in the 8335 GTC 833...

Страница 43: ...Figure 3 Memory locations The following table provides the memory locations Finding parts and locations 29...

Страница 44: ...335 GTG or 8335 GTH or Removing and replacing memory in the 8335 GTW or 8335 GTX 30 DIMM 1 31 DIMM 2 32 DIMM 3 33 DIMM 4 34 DIMM 5 35 DIMM 6 36 DIMM 7 37 DIMM 8 38 DIMM 9 39 DIMM 10 40 DIMM 11 41 DIMM...

Страница 45: ...ttaching screws 8335 GTC 8335 GTG 8335 GTH or 8335 GTX 1 01EM209 1 Fixed rail kit contains left and right fixed rails and attaching screws 8335 GTW 2 00E4260 1 Slide rail kit contains left and right s...

Страница 46: ...nd attaching screws 8335 GTW 5 00E4260 1 Slide rail kit contains left and right slide rails and attaching screws 8335 GTC or 8335 GTG 6 00E7329 1 Electronic Industries Association EIA bracket right si...

Страница 47: ...System parts Figure 5 System parts Finding parts and locations 33...

Страница 48: ...power switch cable 8335 GTC or 8335 GTW 10 1 Screw 11 00E4252 0 2 Drive filler 11 00LY460 0 2 960 GB solid state drive 11 00LY461 0 2 1 92 TB solid state drive 11 00LY462 0 2 3 84 TB solid state drive...

Страница 49: ...Additional system parts 8335 GTC 8335 GTG or 8335 GTH air cooled system Figure 6 Additional system parts 8335 GTC 8335 GTG or 8335 GTH air cooled system Finding parts and locations 35...

Страница 50: ...m processor module processor tray 4 mm hex driver module replacement tool and air pump 8335 GTC or 8335 GTG 02CL564 2 DD2 2 16 core 2 7 GHz system processor module kit includes system processor module...

Страница 51: ...rent speeds or differing numbers of cores To determine the DDx y level type the following command and press Enter openbmctool U username P password H BMC IP address or BMC host name fru print The DDx...

Страница 52: ...GTX water cooled system with 4 GPUs Figure 7 Additional system parts 8335 GTW or 8335 GTX water cooled system with 4 GPUs 38 Power Systems Problem analysis system parts and locations for the 8335 GTC...

Страница 53: ...d plate assembly for systems with 4 GPUs includes cold plates tweezers 4 mm hex driver and installation tool 28 01KL428 2 Pipe holder 1 8335 GTW or 8335 GTX 29 01KL429 2 Pipe holder 2 8335 GTW or 8335...

Страница 54: ...rocessor tray 4 mm hex driver module replacement tool and air pump 8335 GTW 02CL567 2 DD2 2 22 core 2 8 GHz system processor module kit includes system processor module processor tray 4 mm hex driver...

Страница 55: ...Additional system parts 8335 GTW or 8335 GTX water cooled system with 6 GPUs Figure 8 Additional system parts 8335 GTW or 8335 GTX water cooled system with 6 GPUs Finding parts and locations 41...

Страница 56: ...ate assembly for systems with 6 GPUs includes cold plates tweezers 4 mm hex driver and installation tool 8335 GTW or 8335 GTX 38 01KL429 2 Pipe bracket 2 8335 GTW 8335 GTX 39 01EM006 1 Baseboard manag...

Страница 57: ...ncludes system processor module processor tray 4 mm hex driver module replacement tool and air pump 8335 GTX 02CM217 2 DD2 3 22 core 2 8 GHz system processor module kit includes system processor modul...

Страница 58: ...crew for the empty GPU slot There are a total number of three screws in this kit 01EM312 1 System screw kit includes screws for the system backplane BMC card and disk drive and fan card 8335 GTC 8335...

Страница 59: ...ct s and or the program s described in this publication at any time without notice Any references in this information to non IBM websites are provided for convenience only and do not in any manner ser...

Страница 60: ...ent successfully Overview The IBM Power Systems servers include the following major accessibility features Keyboard only operation Operations that use a screen reader The IBM Power Systems servers use...

Страница 61: ...iable information If the configurations deployed for this Software Offering provide you as the customer the ability to collect personally identifiable information from end users via cookies and other...

Страница 62: ...device complies with Part 15 of the FCC rules Operation is subject to the following two conditions 1 this device may not cause harmful interference and 2 this device must accept any interference recei...

Страница 63: ...t explains the JEITA statement for products greater than 20 A single phase This statement explains the JEITA statement for products greater than 20 A per phase three phase Electromagnetic Interference...

Страница 64: ...hinweis versehen werden Warnung Dieses ist eine Einrichtung der Klasse A Diese Einrichtung kann im Wohnbereich Funk St rungen verursachen in diesem Fall kann vom Betreiber verlangt werden angemessene...

Страница 65: ...pment into an outlet on a circuit different from that to which the receiver is connected Consult an IBM authorized dealer or service representative for help Properly shielded and grounded cables and c...

Страница 66: ...ns the Japan Electronics and Information Technology Industries Association JEITA statement for products less than or equal to 20 A per phase This statement explains the JEITA statement for products gr...

Страница 67: ...republik Deutschland Zulassungsbescheinigung laut dem Deutschen Gesetz ber die elektromagnetische Vertr glichkeit von Ger ten EMVG bzw der EMC Richtlinie 2014 30 EU f r Ger te der Klasse B Dieses Ger...

Страница 68: ...he right to withdraw the permissions granted herein whenever in its discretion the use of the publications is detrimental to its interest or as determined by IBM the above instructions are not being p...

Страница 69: ......

Страница 70: ...IBM...

Отзывы: