background image

declustered array (DA)

A disjoint subset of the pdisks in a recovery group.

dependent fileset

A fileset that shares the inode space of an existing independent fileset.

DFM

See direct FSP management (DFM).

DHCP

See Dynamic Host Configuration Protocol (DHCP).

direct FSP management (DFM)

The ability of the xCAT software to communicate directly with the Power Systems server's service

processor without the use of the HMC for management.

drawer control module (DCM)

Essentially, a SAS expander on a storage enclosure drawer.

Dynamic Host Configuration Protocol (DHCP)

A standardized network protocol that is used on IP networks to dynamically distribute such network

configuration parameters as IP addresses for interfaces and services.

E

Elastic Storage System (ESS)

A high-performance, GPFS NSD solution made up of one or more building blocks. The ESS software

runs on ESS nodes - management server nodes and I/O server nodes.

ESS Management Server (EMS)

An xCAT server is required to discover the I/O server nodes (working with the HMC), provision the

operating system (OS) on the I/O server nodes, and deploy the ESS software on the management

node and I/O server nodes. One management server is required for each ESS system composed of

one or more building blocks.

encryption key

A mathematical value that allows components to verify that they are in communication with the

expected server. Encryption keys are based on a public or private key pair that is created during the

installation process. See also file encryption key (FEK)master encryption key (MEK).

ESS

See Elastic Storage System (ESS).

environmental service module (ESM)

Essentially, a SAS expander that attaches to the storage enclosure drives. In the case of multiple

drawers in a storage enclosure, the ESM attaches to drawer control modules.

ESM

See environmental service module (ESM).

Extreme Cluster/Cloud Administration Toolkit (xCAT)

Scalable, open-source cluster management software. The management infrastructure of ESS is

deployed by xCAT.

F

failback

Cluster recovery from failover following repair. See also failover.

failover

(1) The assumption of file system duties by another node when a node fails. (2) The process of

transferring all control of the ESS to a single cluster in the ESS when the other clusters in the ESS fails.

See also cluster. (3) The routing of all transactions to a second controller when the first controller fails.

See also cluster.

64  IBM Elastic Storage System 3000: Service Guide

Summary of Contents for 3000 6.0.2

Page 1: ...IBM Elastic Storage System 3000 6 0 2 Service Guide IBM SC28 3187 01...

Page 2: ...uct number 5765 DME IBM Spectrum Scale Data Access Edition for IBM ESS product number 5765 DAE IBM welcomes your comments see the topic How to submit your comments on page xi When you send information...

Page 3: ...blank 23 Removing and replacing a power supply unit 23 Removing and replacing a power interposer 26 Miscellaneous equipment specification MES instructions 28 ESS 3000 storage drives MES upgrade 28 ESS...

Page 4: ...Index 71 Index 71 iv...

Page 5: ...in the management GUI 24 7 Features of a power supply unit 25 8 Removing the power supply unit 26 9 Sliding out the power interposer 27 10 Removing a power interposer 27 11 Inserting the new power int...

Page 6: ...vi...

Page 7: ...ster component 2 4 Events for the Enclosure component 4 5 Events for the physical disk component 8 6 Events for the Recovery group component 11 7 Server events 11 8 Events for the virtual disk compone...

Page 8: ...viii...

Page 9: ...cluster is based Service Guide This unit provides ESS 3000 information including events servicing and parts listings System administrators and IBM support team Problem Determination Guide This unit p...

Page 10: ...um Scale RAID see the IBM Spectrum Scale RAID FAQ in IBM Knowledge Center http www ibm com support knowledgecenter SSYSP8 gnrfaq html Conventions used in this information Table 1 on page x describes t...

Page 11: ...scriptions Ctrl x The notation Ctrl x indicates a control character sequence For example Ctrl c means that you hold down the control key while pressing c item Ellipses indicate that you can repeat the...

Page 12: ...xii IBM Elastic Storage System 3000 Service Guide...

Page 13: ...le system Array events The following table lists the events that are created for the Array component Table 2 Events for the Array component Event Event Type Severity Message Description Cause User Act...

Page 14: ...y is not given anymore The tsplatformstat a command returns only one instead of two bootdrives Two drives are expected to ensure redundancy Inspect that the drive is correctly installed on the referen...

Page 15: ...re command Replace the canister can_temp_sensor_ok STATE_CHANGE INFO Temperature sensor 0 is OK The temperature sensor state is OK N A N A canister_failed STATE_CHANGE ERROR Canister 0 is failed The c...

Page 16: ...The speed of all populated memory dimm slots is as expected The opt ibm gss tools bin ess3kplt command returned no speed errors N A dimm_speed_wrong STATE_CHANGE ERROR One or more memory dimm modules...

Page 17: ...M 0 is failed The DCM state is failed N A N A dcm_not_available STATE_CHANGE WARNIN G DCM 0 is not available The DCM is not installed or not responding N A N A dcm_ok STATE_CHANGE INFO DCM id 1 is ok...

Page 18: ...HANGE INFO Enclosure 0 is ok The enclosure state is ok N A N A enclosure_unknown STATE_CHANGE WARNIN G Enclosure state 0 is unknown The enclosure state is unknown N A N A enclosure_vanished INFO_DELET...

Page 19: ...led STATE_CHANGE WARNIN G Power supply 0 is failed The power supply state is failed N A N A power_supply_off STATE_CHANGE WARNIN G Power supply 0 is off The power supply is not providing power N A N A...

Page 20: ...allen below the actual low critical threshold value for at least one sensor N A N A voltage_low_warn STATE_CHANGE WARNIN G Voltage sensor 0 measured a low voltage value The voltage has fallen below th...

Page 21: ...or both as soon as possible gnr_nvram_unhealth y STATE_CHANGE WARNING The NVDIMM of the pdisk 0 is unhealthy Error is detected but save or restore might still work for the NVRAM drive of the disk The...

Page 22: ...nished A GNR pdisk listed in the IBM Spectrum Scale configuration was not detected A GNR pdisk listed in the IBM Spectrum Scale configuration as mounted before is not found This could be a valid situa...

Page 23: ...Spectrum Scale configuration was not detected A GNR recov ery group listed in the IBM Spect rum Scale confi gurati on as moun ted befor e is not found This could be a valid situat ion Run the mmlsreco...

Page 24: ...12V_ok STATE_CHANGE INFO OC Line 12V of Power Supply 0 is ok The GUI checks the hardware state using xCAT The hardware part is ok None server_power_supply_oc_line_ 12V_failed STATE_CHANGE ERRO R OC Li...

Page 25: ...e server_power_ supply_ok STATE_CHANGE INFO Power Supply 0 is ok The GUI checks the hardware state using xCAT The hardware part is ok None server_power_ supply_failed STATE_CHANGE ERRO R Power Supply...

Page 26: ...HANGE INFO DASD Backplane 0 is ok The GUI checks the hardware state using xCAT The hardware part is ok None dasd_backplane_failed STATE_CHANGE ERRO R DASD Backplane 0 failed The GUI checks the hardwar...

Page 27: ...the hardware state using xCAT The hardware part is ok None server_ps_resource _failed STATE_CHANGE ERRO R At least one Power Supply of server 0 has insufficient resources The GUI checks the hardware s...

Page 28: ...checks the hardware state using xCAT The hardware part failed None server_ok STATE_CHANGE INFO The server 0 is healthy The GUI checks the hardware state using xCAT The hardware part is ok None server...

Page 29: ...k state is unknown N A N A gnr_vdisk_vanished INFO_DELETE_E NTITY INFO GNR vdisk 0 has vanished A GNR vdisk listed in the IBM Spectrum Scale configuration was not detected A GNR vdisk listed in the IB...

Page 30: ...18 IBM Elastic Storage System 3000 Service Guide...

Page 31: ...list of events that are available under the Monitoring Events page You can also select the Display unhealthy devices option in the Monitoring Hardware Details page to see all the unhealthy devices inc...

Page 32: ...k A lower priority value means a higher need for replacement Preparing disks for replacement 1 Prepare each of the pdisk name entries for replacement with the following command mmvdisk pdisk replace p...

Page 33: ...he drive 1 Ensure that the LED indicators are at the top of the drive 2 Press the blue touchpoint to unlock the latching handle on the new drive 3 Slide the new drive into the server canister For more...

Page 34: ...sh replacing pdisk e2s11 with the new physical disk by running the following command mmvdisk pdisk replace recovery group BB01L pdisk e2s11 mmvdisk mmvdisk Preparing a new pdisk for use may take many...

Page 35: ...sk Do not remove or loosen any screws 1 Unpack the replacement drive slot filler from its packaging Removing the drive slot filler 2 Use your thumb and fore finger to pinch the latch of the faulty dri...

Page 36: ...agement GUI Two sets of power supply units are available for each enclosure Remove the replacement PSU from its packaging and have it available before carrying out this procedure No tools are required...

Page 37: ...8E021A 78E021A DEGRADED 1 day ago power_supply_failed 78E021A Event Parameter Severity Active Since Event Message power_supply_failed 78E021A WARNING Now Power supply psu1_left_id0 is failed Removing...

Page 38: ...he PSU and the midplane It can only be removed after its PSU is removed from the rear of the enclosure Before you remove or replace a power interposer review the following guidelines for this procedur...

Page 39: ...interposer out until it is clear of the enclosure rear as shown in Figure 10 on page 27 Figure 10 Removing a power interposer Replacing the power interposer 4 Identify the correct empty power slot whe...

Page 40: ...t be at the ESS 5 3 5 1 or ESS 3000 6 0 0 1 level If the setup has any protocol nodes these nodes must also be upgraded to ESS 5 3 5 1 levels underlying code IBM Spectrum Scale 5 0 4 2 verified by usi...

Page 41: ...the automount is disabled on the file systems and the remote clusters 8 Issue the mmshutdown command on the ESS 3000 canister servers 9 Power off the ESS 3000 by removing the cables that are at the b...

Page 42: ...to choose from include EC64 InfiniBand and EC67 Ethernet Because the adapter MES is designed to be a concurrent procedure it is recommended to perform this procedure during a service window of low act...

Page 43: ...13 Ethernet ports on canister 2 lower canister Note These images show the PCIe ports for two adapters for each canister The MES upgrade installs the third adapter with two more ports in the PCIe slot...

Page 44: ...root ess3k5a mmumount fs3k70 N ess3k5a 3 Shut down GPFS only in canister A Customer task mmshutdown N canister A Example root ess3k5a mmshutdown N ess3k5a 4 Remove canister A from the enclosure SSR ta...

Page 45: ...ailable enp29s0f2 ethernet unavailable enp29s0f3 ethernet unavailable lo loopback unmanaged b If this MES upgrade for EC64 InfiniBand issue the following commands nmcli c add type infiniband ifname ne...

Page 46: ...verbsPort Example root ess3k5a mmlsconfig Y grep i verbsPort mmlsconfig 0 1 verbsPorts mlx5_1 1 mmlsconfig 0 1 verbsPorts mlx5_0 1 mlx5_1 1 bodhi1 ib gssio2 ib bodhi_nc1 ems mmlsconfig 0 1 verbsPorts...

Page 47: ...down GPFS only in canister B Customer task mmshutdown N canister B Example root ess3k5b mmshutdown N ess3k5b 13 Remove canister B from the enclosure SSR task Record the high speed and management netw...

Page 48: ...67 Ethernet issue the following commands nmcli c add type ethernet ifname new interface name 1 master bond name nmcli c add type ethernet ifname new interface name 2 master bond name d To validate the...

Page 49: ...ister B again if necessary mmlsmount fs1 L or mmlsmount all L b Move the quorum node back to the original state if necessary ESS 3000 adapter MES upgrade AJP1 This procedure is intended for the concur...

Page 50: ...canister B Steps 1 3 are to be customer tasks Steps 4 6 are SSR tasks Steps 7 12 are expected to be customer tasks Steps 13 15 are SSR tasks Steps 16 18 are expected to be customer tasks Summary The g...

Page 51: ...as a quorum node move the quorum to a different node i To determine which nodes are quorum nodes issue the following command mmlscluster ii To evaluate quorum status issue the following command mmgets...

Page 52: ...3 Check enclosure cabling and paths to disks Both options must be successful Option 2 must show six network adapters Example localhost Valid Network Adapter Configuration Number of Adapter s found 6 T...

Page 53: ...sk_78E05N1 ess3k_78E05N1 gssio1_ibgssio2_ib rg_gssio1 ib rg_gssio2 ib root ess3k5a mmgetstate N ess_x86_64_mmvdisk_78E05N1 Node number Node name GPFS state 21 ess3k5a ib active 22 ess3k5b ib active c...

Page 54: ...onnect original high speed and management network cables c Connect the new high speed cables to the new adapter if provided by the customer d Perform basic checks via the technician port in canister B...

Page 55: ...unt all L b Move the quorum node back to the original state if necessary ESS 3000 memory MES upgrade Overview This procedure is intended for the non concurrent offline memory MES installation of featu...

Page 56: ...target ESS 3000 into a note ii To get file systems information issue the following command mmvdisk fs list Identify and copy and paste one or more file system names that are associated with the targe...

Page 57: ...apter of the IBM Elastic Storage System 3000 Hardware Planning and Installation Guide to do the following steps i Plug your laptop to point to point to each container technician port ii Log in as esss...

Page 58: ...p N ess_x86_64_mmvdisk_5 Wed Feb 19 16 37 02 EST 2020 mmstartup Starting GPFS e Change the pagepool to 60 which is 460G by issuing the following command mmchconfig pagepool 460G N node class name f En...

Page 59: ...systems are up d Confirm that one or more target file systems are mounted by issuing the following command mmlsmount filesystem L 10 Do health check by issuing the following command and resume I O bec...

Page 60: ...ent MES upgrade steps 1 Ensure that the technical delivery assessment TDA process is complete before you start the MES upgrade 2 Ensure that the system is at the ESS 3000 6 0 0 2 or later level for th...

Page 61: ...ware location drive 3 84TB NVMe G3 Tier 1 Flash mySN SN1ESN1E SN1ISN1I ess3ka ib Rack myrack U37 38 Enclosure 5141 AF8 mySN Drive 12 drive 3 84TB NVMe G3 Tier 1 Flash mySN SN1ESN1E SN1ISN1I ess3ka ib...

Page 62: ...restarting GPFS on the ESS 3000 canisters on page 52 11 Verify that the newly added space is available to the system mmvdisk pdisk list recovery group ESS 3000 recovery group Example ess3ka mmvdisk pd...

Page 63: ..._ess3k 1074 MiB vs_ess3k_1 1077 MiB ess3ka b Create the newly defined vdisk set ess3ka mmvdisk vdiskset create vdisk set vs_ess3k_1 mmvdisk 4 vdisks and 4 NSDs will be created in vdisk set vs_ess3k_1...

Page 64: ...to understand the effect of losing quorum on the canister during the GPFS recycling 1 Check the quorum state by issuing the following command mmgetstate s A sample output is as follows Node number No...

Page 65: ...21 ess3ka ib active 22 ess3kb ib active 6 Repeat the mmshutdown command and the mmstartup command on canisterB Chapter 2 Servicing 53...

Page 66: ...54 IBM Elastic Storage System 3000 Service Guide...

Page 67: ...LL512 3 84 TB 2 5 NVMe Flash drive 01LL513 7 68 TB 2 5 NVMe Flash drive 01LL514 15 36 TB 2 5 NVMe Flash drive 01LL515 Left Brand Bezel 01LL519 FRU part number list The FRU part numbers are listed in t...

Page 68: ...dule 01YM313 Cable part number list The cable part numbers are listed in the table Cable part number list Table 11 Cable Part Numbers Description Part Number IB cbl 2 M 0000000RX861 CR2032 coin cell 0...

Page 69: ...001AF043 3M Yellow Ethernet Cat 5E cable 0000001AF045 0 5M QSFP28 passive copper 100Gb Ethernet cable 0000001FT718 1M QSFP28 passive copper 100Gb Ethernet cable 0000001FT719 1 5M QSFP28 passive copper...

Page 70: ...L469 5M Yellow Ethernet Cat 5E cable 0000002CL470 ELC5 Power Cable Drawer to IBM PDU C13 C20 250V 10A 0000002EA542 6665 Power Cablem 9 2 ft Drawer to IBM PDU C13 C20 250V 10A 0000039M5392 6672 Power C...

Page 71: ...touch but do not activate just by touching them Industry standard devices for ports and connectors The attachment of alternative input and output devices IBM Knowledge Center and its related publicat...

Page 72: ...60 IBM Elastic Storage System 3000 Service Guide...

Page 73: ...ACHINES CORPORATION PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND EITHER EXPRESS OR IMPLIED INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF NON INFRINGEMENT MERCHANTABILITY OR FI...

Page 74: ...You may copy modify and distribute these sample programs in any form without payment to IBM for the purposes of developing using marketing or distributing application programs conforming to the applic...

Page 75: ...entral processor complex CPC central processor complex CPC A physical collection of hardware that consists of channels timers main storage and one or more central processors cluster A loosely coupled...

Page 76: ...ment node and I O server nodes One management server is required for each ESS system composed of one or more building blocks encryption key A mathematical value that allows components to verify that t...

Page 77: ...d controls token management and quota management fileset A hierarchical grouping of files managed as a unit for balancing workload across a cluster See also dependent fileset independent fileset files...

Page 78: ...It is the NSD server for the GPFS cluster IP See Internet Protocol IP IP over InfiniBand IPoIB Provides an IP network emulation layer on top of InfiniBand RDMA networks which allows existing applicati...

Page 79: ...ications MS See management server MS MTU See maximum transmission unit MTU N Network File System NFS A protocol developed by Sun Microsystems Incorporated that allows any host in a network to gain acc...

Page 80: ...om the other disk drives in the array due to data redundancy recovery The process of restoring access to file system data when a failure has occurred Recovery can involve reconstructing data or provid...

Page 81: ...tion of STP is to prevent bridge loops and the broadcast radiation that results from them SSH See secure shell SSH STP See Spanning Tree Protocol STP symmetric multiprocessing SMP A computer architect...

Page 82: ...70 IBM Elastic Storage System 3000 Service Guide...

Page 83: ...very group events 11 server events 11 virtual disk events 16 I IBM Elastic Storage System 3000 28 IBM Spectrum Scale events 1 2 4 8 11 16 RAS events 1 2 4 8 11 16 information overview ix L license inq...

Page 84: ...72 IBM Elastic Storage System 3000 Service Guide...

Page 85: ...very group events 11 server events 11 virtual disk events 16 I IBM Elastic Storage System 3000 28 IBM Spectrum Scale events 1 2 4 8 11 16 RAS events 1 2 4 8 11 16 information overview ix L license inq...

Page 86: ...74 IBM Elastic Storage System 3000 Service Guide...

Page 87: ......

Page 88: ...IBM Product Number 5765 DME 5765 DAE SC28 3187 01...

Reviews: