Memory Subsystem Behaviors
The zx2 chip in the server provides increased reliability of memory DIMMs and memory expanders.
For example, previous entry class servers with zx1 chips provided error detection and correction
of all memory DIMM single-bit errors and error detection of most multi-bit errors within a memory
DIMM quad, or 4 bits per rank (this feature is called chip sparing).
The zx2 chip doubles memory rank error correction from 4 bytes to 8 bytes of a 128 byte cache
line, during cache line misses initiated by processor cache controllers and by Direct Memory Access
(DMA) operations initiated by I/O devices. This feature is called double DRAM sparing, as 2 of
72 DRAMs in any DIMM quad can fail without any loss of server performance.
Corrective action, DIMM/memory expander replacement, is required when a threshold is reached
for multiple double-byte errors from one or more memory DIMMs in the same rank. And when any
uncorrectable memory error (more than 2 bytes) or when no quad of like memory DIMMs is loaded
in rank 0 of side 0. All other causes of memory DIMM errors are corrected by zx2 and reported
to the Page Deallocation Table (PDT) / diagnostic LED panel.
Customer Messaging Policy
•
Only light a diagnostic LED for memory DIMM errors when isolation is to a specific memory
DIMM. If any uncertainty about a specific DIMM, point customer to the SEL for any action and
do not light the suspect DIMM CRU LED on the diagnostic panel.
•
For configuration style errors, for example, no memory DIMMs installed in rank 0 of side 0,
follow the HP policy of lighting all of the CRU LEDs on the diagnostic LED panel for all of the
DIMMs that are missing.
•
No diagnostic messages are reported for single-byte errors that are corrected in both zx2
caches and memory DIMMs during corrected platform error (CPE) events. Diagnostic messages
are reported for CPE events when thresholds are exceeded for both single-byte and double
byte errors; all fatal memory subsystem errors cause global MCA events.
•
PDT logs for all double byte errors will be permanent; single byte errors will initially be logged
as transient errors. If the server logs 2 single byte errors within 24 hours, upgrade them to
permanent in the PDT.
Table 53
lists the memory subsystem evens that light the diagnostic panel LEDs.
Table 53 Memory Subsystem Events That Light Diagnostic Panel LEDs
Notes
Source
Cause
Sample IPMI Events
Diagnostic LEDs
A voltage on the
memory
BMC
Voltage on memory
expander is
inadequate
Type 02h, 02h:07h:03h
VOLTAGE_DEGRADES_TO_NON_RECOVERABLE
Memory
Carriers
expander is out
of range (likely
too low)
Light all DIMM
LEDs in rank 0
of cell 0
SFW
No memory DIMMs
installed (in rank 0 of
cell 0)
Type E0h, 208d:04d
MEM_NO_DIMMS_INSTALLED
DIMMs
Either EEPROM
is
SFW
A DIMM has a serial
presence detect (SPD)
Type E0h, 172d:04d
MEM_DIMM_SPD_CHECKSUM
DIMMs
misprogrammed
EEPROM with a bad
checksum
or this DIMM is
incompatible
Memory rank is
about to fail or
WIN
Agent
This memory rank is
correcting too many
single-bit errors
Type E0h, 4652d:26d
WIN_AGT_PREDICT_MEM_FAIL
DIMMs
environmental
conditions are
causing more
errors than usual
CPU/Memory/SBA
159