3
Software/Firmware Description
58
RAS support
The
CPM
uses
Reliability,
Availability
and
Serviceability
(RAS)
features
to
support
enhanced
boot
reliability
and
reduce
system
downtime.
The
following
features
promote
RAS
support:
•
BIOS
protection
and
redundancy
•
Memory
error
handling
•
PCIe
error
handling
•
Processor
and
integrated
memory
controller
error
handling
•
POST
error
handling
•
Watchdog
support
•
BIOS
crisis
recovery
The
system
BIOS
monitors
errors
and
manages
pre
‐
boot
and
boot
events
to
enhance
system
uptime.
These
BIOS
actions
assist
with
the
management
activities
performed
by
the
IPMI
subsystem
and
system
OS
to
affect
overall
system
RAS.
BIOS protection and redundancy
Prior
to
OS
boot
the
BIOS
checks
the
primary
boot
image
checksum
and
notifies
the
IPMC
if
there
is
a
problem.
If
there
is
a
problem
with
the
primary
boot
image,
the
IPMC
switches
to
the
standby
version
and
resets
the
system.
After
boot
and
during
normal
operation,
the
IPMC
Corrupt
Flash
Watchdog
forces
a
switch
to
the
secondary
boot
image
by
the
IPMC
if
a
corrupt
boot
image
is
detected
that
could
prevent
validation
or
cause
a
system
hang.
During
normal
operation,
the
BIOS
program
maintains
an
active
write
protection
on
the
primary
boot
image
stored
in
Flash.
A
jumper
is
also
available
at
the
Customer
header
to
add
a
physical
write
protection
for
the
BIOS
image.
Memory error handling
Memory
error
detection
and
much
of
the
memory
error
handling
is
performed
by
the
integrated
memory
controllers
(iMCs)
in
the
CPUs,
but
the
BIOS
supports
only
a
subset
of
the
memory
RAS
features
detected
by
the
iMC.
At
POST,
the
system
BIOS
uses
SPD
data
on
each
memory
module
to
allow
the
iMC
to
find
its
optimal
operating
point,
or
“train”
the
module
and
set
the
optimal
operating
point
for
that
module.
If
a
module
cannot
be
“trained”
or
there
are
any
errors
detected
during
the
memory
training
process,
the
error
is
reported
to
the
IPMC.