Chapter 4. Reliability, availability, and serviceability
151
Draft Document for Review October 14, 2014 10:19 am
5137ch04.fm
4.3.5 Alternative processor recovery and Partition Availability Priority
If Processor Instruction Retry for a fault within a core occurs multiple times without success,
the fault is considered to be a solid failure. In some instances, PowerVM can work with the
processor hardware to migrate a workload running on the failing processor to a spare or
alternative processor. This migration is accomplished by migrating the pertinent processor
core state from one core to another with the new core taking over at the instruction that failed
on the faulty core. Successful migration keeps the application running during the migration
without needing to terminate the failing application.
Successful migration requires that there is sufficient spare capacity that is available to reduce
the overall processing capacity within the system by one processor core. Typically, in highly
virtualized environments, the requirements of partitions can be reduced to accomplish this
task without any further impact to running applications.
In systems without sufficient reserve capacity, it might be necessary to terminate at least one
partition to free resources for the migration. In advance, PowerVM users can identify which
partitions have the highest priority and which do not. When you use this Partition Priority
feature of PowerVM, if a partition must be terminated for alternative processor recovery to
complete, the system can terminate lower priority partitions to keep the higher priority
partitions up and running, even when an unrecoverable error occurred on a core running the
highest priority workload.
Partition Availability Priority is assigned to partitions by using a weight value or integer rating.
The lowest priority partition is rated at 0 (zero) and the highest priority partition is rated at
255. The default value is set to 127 for standard partitions and 192 for Virtual I/O Server
(VIOS) partitions. Priorities can be modified through the Hardware Management Console
(HMC).
4.3.6 Core Contained Checkstops and other PowerVM error recovery
PowerVM can handle certain other hardware faults without terminating applications, such as
an error in certain data structures (faults in translation tables or lookaside buffers).
Other core hardware faults that alternative processor recovery or Processor Instruction Retry
cannot contain might be handled in PowerVM by a technique called Core Contained
Checkstops. This technique allows PowerVM to be signaled when such faults occur and
terminate code in use by the failing processor core (typically just a partition, although
potentially PowerVM itself if the failing instruction were in a critical area of PowerVM code).
Processor designs without Processor Instruction Retry typically must resort to such
techniques for all faults that can be contained to an instruction in a processor core.
4.3.7 Cache uncorrectable error handling
If a fault within a cache occurs that cannot be corrected with SEC/DED ECC, the faulty cache
element is unconfigured from the system. Typically, this is done by purging and deleting a
single cache line. Such purge and delete operations are contained within the hardware itself,
and prevent a faulty cache line from being reused and causing multiple errors.
During the cache purge operation, the data that is stored in the cache line is corrected where
possible. If correction is not possible, the associated cache line is marked with a special ECC
code that indicates that the cache line itself has bad data.