Chapter 4. Continuous availability and manageability
155
With the instruction retry function, when an error is encountered in the core, in caches and
certain logic functions, the processor first automatically retries the instruction. If
the source of the error was truly transient, the instruction succeeds and the system can
continue as before.
Alternate processor retry
Hard failures are more difficult, being permanent errors that will be replicated each time that
the instruction is repeated. Retrying the instruction does not help in this situation because the
instruction will continue to fail.
As introduced with POWER6, processors have the ability to extract the failing
instruction from the faulty core and retry it elsewhere in the system. The failing core is then
dynamically deconfigured and scheduled for replacement.
Dynamic processor deallocation
Dynamic processor deallocation enables automatic deconfiguration of processor cores when
patterns of recoverable core-related faults are detected. Dynamic processor deallocation
prevents a recoverable error from escalating to an unrecoverable system error, which might
otherwise result in an unscheduled server outage. Dynamic processor deallocation relies on
the service processor’s ability to use recoverable error information generated by FFDC to
notify the POWER Hypervisor when a processor core reaches its predefined error limit. The
POWER Hypervisor then dynamically deconfigures the failing core and notifies the system
administrator that a replacement is needed. The entire process is transparent to the partition
owning the failing instruction.
Single processor checkstop
As in the POWER6 processor, the processor provides single core check-stopping
for certain processor logic, command, or control errors that cannot be handled by the
availability enhancements in the preceding section.
This approach significantly reduces the probability of any one processor affecting total system
availability by containing most processor checkstops to the partition that was using the
processor at the time full checkstop goes into effect.
Even with all these availability enhancements to prevent processor errors from affecting
system-wide availability into play, there will be errors that can result in a system-wide outage.
4.2.3 Memory protection
A memory protection architecture that provides good error resilience for a relatively small L1
cache might be inadequate for protecting the much larger system main store. Therefore, a
variety of protection methods are used in all POWER processor-based systems to avoid
uncorrectable errors in memory.
Memory protection plans must take into account many factors, including these items:
Size
Desired performance
Memory array manufacturing characteristics
Before POWER6: On IBM systems prior to POWER6, such an error typically caused a
checkstop.
Содержание Power 720 Express
Страница 2: ......
Страница 14: ...xii IBM Power 720 and 740 Technical Overview and Introduction ...
Страница 128: ...114 IBM Power 720 and 740 Technical Overview and Introduction ...
Страница 204: ...190 IBM Power 720 and 740 Technical Overview and Introduction ...
Страница 205: ......