Chapter 3. RAS and manageability
49
Figure 3-1 Schematic of Fault Isolation Register implementation
The FIRs are important because they enable an error to be uniquely identified, thus enabling
the appropriate action to be taken. Appropriate actions might include such things as a bus
retry, ECC correction, or system firmware recovery routines. Recovery routines can include
dynamic deallocation of potentially failing components.
Errors are logged into the system non-volatile random access memory (NVRAM) and the SP
event history log. Diagnostic Error Log Analysis (
diagela
) routines analyze the error log
entries and invoke a suitable action such as issuing a warning message. If the error can be
recovered, or after suitable maintenance, the service processor resets the FIRs so that they
can accurately record any future errors.
The ability to correctly diagnose any pending or firm errors is a key requirement before any
dynamic or persistent component deallocation or any other reconfiguration can take place.
For more information, see “Dynamic or persistent deallocation” on page 51.
3.1.3 Permanent monitoring
The SP included in the OpenPower 720 provides a means to monitor the system even when
the main processor is inoperable. See the following subsections for a more detailed
description of monitoring functions in an OpenPower 720.
Mutual surveillance
The SP can monitor the operation of the firmware during the boot process, and it can monitor
the operating system for loss of control. This allows the service processor to take appropriate
action, including calling for service, when it detects that the firmware or the operating system
has lost control. Mutual surveillance also allows the operating system to monitor for service
processor activity and can request a service processor repair action if necessary.
Environmental monitoring
Environmental monitoring related to power, fans, and temperature is done by the System
Power Control Network (SPCN). Environmental critical and non-critical conditions generate
Early Power-Off Warning (EPOW) events. Critical events (for example, Class 5 AC power
loss) trigger appropriate signals from hardware to impacted components so as to prevent any
data loss without the operating system or firmware involvement. Non-critical environmental
events are logged and reported through Event Scan.
CPU
L1 Cache
L2/L3 Cache
Memory
F
ault
I
solation
R
egister
(FIR)
(unique fingerprint of each
error captured)
Service
Processor
Non-volatile
RAM
Error Checkers
Log Error
Disk
Содержание eServer OpenPower 720
Страница 2: ......
Страница 28: ...18 OpenPower 720 Technical Overview and Introduction...
Страница 68: ...58 OpenPower 720 Technical Overview and Introduction...
Страница 72: ...62 OpenPower 720 Technical Overview and Introduction...
Страница 73: ......