Chapter 9. Reliability, availability, and serviceability
359
Memory subsystem improvements
z13s servers use RAIM, which is a concept that is known in the disk industry as RAID.
RAIM design detects and recovers from DRAM, socket, memory channel, or DIMM
failures. The RAIM design includes the addition of one memory channel that is dedicated
for RAS. The parity of the four
data
DIMMs is stored in the DIMMs that are attached to a
fifth memory channel. Any failure in a memory component can be detected and corrected
dynamically.
This design takes the RAS of the memory subsystem to another level, making it
essentially a fully fault-tolerant
N+1 design. The memory system on z13s servers is
implemented with an enhanced version of the Reed-Solomon error correction code (ECC)
that is known as 90B/64B, and includes protection against memory channel and DIMM
failures.
A precise marking of faulty chips helps assure timely DRAM replacements. The key cache
on the z13s memory is completely mirrored. For a full description of the memory system
on z13s servers, see 2.4, “Memory” on page 53.
Improved thermal and condensation management
Soft-switch firmware
The capabilities of soft-switching firmware have been enhanced. Enhanced logic in this
function ensures that every affected circuit is powered off during the soft switching of
firmware components. For example, if you must upgrade the microcode of a Fibre Channel
connection (FICON) feature, enhancements have been implemented to avoid any
unwanted side effects that have been detected on previous servers.
Server Time Protocol (STP) recovery enhancement
When HCA3-O (12xIFB) or HCA3-O Long Reach (LR) (1xIFB) or PCIe based ICA SR
coupling links are used, an unambiguous “going away signal” is sent when the server on
which the HCA3 is running is about to enter a failed (check stopped) state.
When the “going away signal” sent by the Current Time Server (CTS) in an STP-only
Coordinated Timing Network (CTN) is received by the Backup Time Server (BTS), the
BTS can safely take over as the CTS without relying on the previous Offline Signal (OLS)
in a two-server CTN, or as the Arbiter in a CTN with three or more servers.
Enhanced Console Assisted Recovery (ECAR) is new with z13s and z13 GA2, and
provides better recovery algorithms during a failing PTS scenario. It uses communication
over the HMC/SE network to speed up the process of BTS takeover. See “Enhanced
Console Assisted Recovery” on page 419 for more information.
9.3.2 Scheduled outages
Concurrent hardware upgrades, concurrent parts replacement, concurrent driver upgrade,
and concurrent firmware fixes are available with z13s servers, and address the elimination of
scheduled outages. Furthermore, the following indicators and functions that address
scheduled outages are included:
Double memory data bus lane sparing
This sparing reduces the number of repair actions for memory
Single memory clock sparing
Double-dynamic random access memory (DRAM) IBM Chipkill tolerance
Field repair of the cache fabric bus
Power distribution N+2 design
Summary of Contents for z13s
Page 2: ......
Page 3: ...International Technical Support Organization IBM z13s Technical Guide June 2016 SG24 8294 00 ...
Page 24: ...THIS PAGE INTENTIONALLY LEFT BLANK ...
Page 164: ...136 IBM z13s Technical Guide ...
Page 226: ...198 IBM z13s Technical Guide ...
Page 256: ...228 IBM z13s Technical Guide ...
Page 414: ...386 IBM z13s Technical Guide ...
Page 464: ...436 IBM z13s Technical Guide ...
Page 476: ...448 IBM z13s Technical Guide ...
Page 498: ...470 IBM z13s Technical Guide ...
Page 502: ...474 IBM z13s Technical Guide ...
Page 568: ...540 IBM z13s Technical Guide ...
Page 578: ...550 IBM z13s Technical Guide ...
Page 584: ...556 IBM z13s Technical Guide ...
Page 585: ...ISBN 0738441678 SG24 8294 00 1 0 spine 0 875 1 498 460 788 pages IBM z13s Technical Guide ...
Page 586: ......
Page 587: ......
Page 588: ...ibm com redbooks Printed in U S A Back cover ISBN 0738441678 SG24 8294 00 ...