Relion 1900e/2900e Manual
69
Revision 1.3
At BMC startup, the BMC will check for the fault condition and set the sensor state accordingly. The BMC also
checks for this fault condition at each attempt to DC power-on the system. At each DC power-on attempt, a
beep code is generated if this fault is detected.
The following steps are used to correct the fault condition and clear the sensor state:
1. AC power down the server
2. Install a processor into the CPU _1 socket
3. AC power on the server
7.3.11.3
ERR2 Timeout Monitoring
The BMC supports an ERR2 Timeout Sensor (1 per CPU) that asserts if a CPU’s ERR[2] signal has been
asserted for longer than a fixed time period (> 90 seconds). ERR[2] is a processor signal that indicates when
the IIO (Integrated IO module in the processor) has a fatal error which could not be communicated to the
core to trigger SMI. ERR[2] events are fatal error conditions, where the BIOS and OS will attempt to gracefully
handle error, but may not always be able to do so reliably. A continuously asserted ERR[2] signal is an
indication that the BIOS cannot service the condition that caused the error. This is usually because that
condition prevents the BIOS from running.
When an ERR2 timeout occurs, the BMC asserts/de-asserts the ERR2 Timeout Sensor, and logs a SEL event
for that sensor. The default behavior for BMC core firmware is to initiate a system reset upon detection of an
ERR2 timeout. The BIOS setup utility provides an option to disable or enable system reset by the BMC for
detection of this condition.
7.3.11.4
CATERR Sensor
The BMC supports a CATERR sensor for monitoring the system CATERR signal.
The CATERR signal is defined as having 3 states:
•
high (no event)
•
pulsed low (possibly fatal may be able to recover)
•
low (fatal).
All processors in a system have their CATERR pins tied together. The pin is used as a communication path to
signal a catastrophic system event to all CPUs. The BMC has direct access to this aggregate CATERR signal.
The BMC only monitors for the “CATERR held low” condition. A pulsed low condition is ignored by the BMC.
If a CATERR-low condition is detected, the BMC logs an error message to the SEL against the CATERR sensor
and the default action after logging the SEL entry is to reset the system. The BIOS setup utility provides an
option to disable or enable system reset by the BMC for detection of this condition.
The sensor is rearmed on power-on (AC or DC power on transitions). It is not rearmed on system resets in
order to avoid multiple SEL events that could occur due to a potential reset loop if the CATERR keeps
recurring, which would be the case if the CATERR was due to an MSID mismatch condition.
When the BMC detects that this aggregate CATERR signal has asserted, it can then go through PECI to query
each CPU to determine which one was the source of the error and write an OEM code identifying the CPU
slot into an event data byte in the SEL entry. If PECI is non-functional (functionality is not guaranteed in this
situation), then the OEM code should indicate that the source is unknown.
Event data byte 2 and byte 3 for CATERR sensor SEL events
Summary of Contents for Relion 1900e
Page 2: ...Relion 1900e 2900e Manual Revision 1 3 April 2016 Intel Server Boards and Systems...
Page 11: ...Relion 1900e 2900e Manual x Revision 1 3 Figure 36 Relion 1900e 149 Figure 37 Relion 2900e 152...
Page 14: ...Relion 1900e 2900e Manual Revision 1 3 xiii This page is intentionally left blank...
Page 15: ......