Chapter 4. Continuous availability and manageability
165
Dynamic processor deallocation
Dynamic processor deallocation enables automatic deconfiguration of processor cores when
patterns of recoverable core-related faults are detected. Dynamic processor deallocation
prevents a recoverable error from escalating to an unrecoverable system error, which might
otherwise result in an unscheduled server outage. Dynamic processor deallocation relies on
the service processor’s ability to use FFDC-generated recoverable error information to notify
the POWER Hypervisor when a processor core reaches its predefined error limit. Then, the
POWER Hypervisor dynamically deconfigures the failing core and is called out for
replacement. The entire process is transparent to the partition owning the failing instruction.
If there are available inactivated processor cores or CoD processor cores, the system
effectively puts a CoD processor into operation after an activated processor is determined to
no longer be operational. In this way, the server remains with its total processor power.
If there are no CoD processor cores available system-wide, total processor capacity is
lowered below the licensed number of cores.
Single processor checkstop
As in POWER6, the POWER7 and provide single-processor check-stopping for
certain processor logic, command, or control errors that cannot be handled by the availability
enhancements in the preceding section.
This way significantly reduces the probability of any one processor affecting total system
availability by containing most processor checkstops to the partition that was using the
processor at the time that the full checkstop goes into effect.
Even with all these availability enhancements to prevent processor errors from affecting
system-wide availability, errors might result on a system-wide outage.
4.2.3 Memory protection
A memory protection architecture that provides good error resilience for a relatively small L1
cache might be inadequate for protecting the much larger system main store. Therefore, a
variety of protection methods is used in POWER processor-based systems to avoid
uncorrectable errors in memory.
Memory protection plans must take into account many factors, including the following items:
Size
Desired performance
Memory array manufacturing characteristics
POWER7 and processor-based systems have a number of protection schemes
designed to prevent, protect, or limit the effect of errors in main memory. These capabilities
include the following items:
64-byte ECC code
This innovative ECC algorithm from IBM research allows a full 8-bit device-kill to be
corrected dynamically. This ECC code mechanism works on DIMM pairs on a rank basis.
(Depending on the size, a DIMM might have one, two, or four ranks.) With this ECC code,
an entirely bad DRAM chip can be marked as bad (chip mark). After marking the DRAM
as bad, the code corrects all the errors in the bad DRAM. It can additionally mark a 2-bit
symbol as bad and correct the 2-bit symbol, providing a double-error detect or single-error
correct ECC, or a better level of protection in addition to the detection or correction of a
chipkill event.
Summary of Contents for Power 770
Page 2: ......
Page 14: ...xii IBM Power 770 and 780 9117 MMD 9179 MHD Technical Overview and Introduction ...
Page 134: ...120 IBM Power 770 and 780 9117 MMD 9179 MHD Technical Overview and Introduction ...
Page 172: ...158 IBM Power 770 and 780 9117 MMD 9179 MHD Technical Overview and Introduction ...
Page 218: ...204 IBM Power 770 and 780 9117 MMD 9179 MHD Technical Overview and Introduction ...
Page 219: ......