The channel between the memory buffer and the DIMM is the DDR channel. Because up to three DIMMs
reside on the same DDR channel and two DDR channels might be configured in lockstep (RAS mode
enabled), up to six DIMMs are affected by a single faulty DIMM. It is important to distinguish faulty or
suspect DIMMs from healthy DIMMs that happen to reside on the same bus.
On a new installation, DDR training failures can result from DIMMs being partially unseated during
shipping. A common symptom of a partially unseated DIMM is a MEM_DIMM_NO_VALID_DELAY event.
If the machine is still in the installation phase and hasn’t been released to the customer, before replacing
a DIMM, try removing and reinstalling all the DIMMs on that DDR channel. A DIMM that has been in use
for some time is unlikely to be spontaneously unseated.
If a DIMM suffers a correctable or uncorrectable error at runtime and needs to be replaced, a DIMM pair
might be identified and indicted. A DIMM pair will be two DIMMs on the same memory buffer with the
same loading letter, such as 19A and 24A. In this case, replace both DIMMs in the pair.
CAE generates error events for faulty or suspect DIMMs as indicted, and these DIMMs should be
replaced.
Health Repository, the EFI
info mem
command, and IPMI events might also identify additional
deconfigured DIMMs, sometimes called partner-deconfigured DIMMs, lockstep-disabled DIMMs, or
sibling-disabled DIMMs. These DIMMs are healthy and should not be replaced.
To identify a possible faulty DIMM, use the HR
SHOW INDICT
command. Replace DIMMs that are
indicted. Do not replace DIMMs that are deconfigured unless there are other indications of a faulty DIMM,
such as being specifically identified with DIMMERR.
Solution 3
Cause
Using DIMMERR
If there are memory errors that do not clearly indicate which hardware is at fault, the HR
dimmerr
command can be used to look for patterns of memory failures.
You can use DIMMERR as follows:
• To corroborate other errors that correspond to a specific DIMM or blade
• To indicate memory training faults
• To look for DIMM errors in newly installed or replaced DIMMs
• To look for DIMM errors during partition boot as part of a system installation
IMPORTANT:
DIMMERR will show memory events that were
correctable
. It is important to note that correctable
errors are expected on large memory systems and all systems will show several correctable errors
over time. Correctable errors only result in indictment after reaching a certain threshold.
DIMMs should
not
be replaced for normal correctable errors.
From the Health Repository viewer, enter
dimmerr
<location>
, where
<location>
is the DIMM slot or
a blade.
Example:
dimmerr blade-1/1
returns information about all DIMMs for a server blade in slot 1 of
cabinet 1.
DIMM INFO for Cabinet: 1 Board Slot: 1
dimm-1/1/0/1 Location: 1A
Status: OK No Errors Logged.
dimm-1/1/0/2 Location: 2C
Status: OK No Errors Logged.
dimm-1/1/0/3 Location: 3B
Row Bank Col Type Errors First Detected Last Detected
Troubleshooting
85